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Preface 


The purpose of this book is 


(a): to provide the elements of probability and stochastic processes of direct 
interest to the applied sciences where probabilistic models play an important role, 
most notably the information and communications sciences, the computer sciences, 
operations research and electrical engineering, but also epidemiology, biology, ecol- 
ogy, physics and the earth sciences, 


(b): to introduce very progressively the basic notions of probability and to help 
the reader to acquire the computational skills necessary for the manipulation of 
random variables and vectors (the elementary “calculus of probability”), and 


(c): to give the essentials of the mathematical theory that will bring the reader 
to the level that is indispensable for a profitable application of probability in the 
fields mentioned above. 


The treatment is mathematical yet not unnecessarily abstract. It maintains the 
balance between depth and width that is adequate for the efficient manipulation, 
based on solid theoretical foundations, of the most popular probabilistic models. 
The theoretical tools are presented gradually in such a way as to not deter the 
reader with a wall of technicalities before having the opportunity to understand 
their relevance in simple situations. In particular, the use of the so-called modern 
integration theory (that is, the Lebesgue integral) is postponed until the fifth 
chapter, where it is reviewed in sufficient detail for a rigorous treatment of the 
topics of interest in the various domains of application listed above. All the results 
are proved, except in the rare situations where the tools needed for the proof require 
a deeper immersion into the foundations of measure and integration theory and 
only when their content is intuitive. They are then accompanied by meaningful 
examples of application. 


The contents are organized in three parts. 


Part I: The Elementary Calculus 


In this part (Chapters 1 to 3), the beginner is acquainted with the vocabulary of 
probability theory and with the methods and tricks of the trade that suffice to treat 
simple, yet significant, examples. The first two chapters are devoted to discrete 
probability models and the third chapter to continuous random variables and vec- 
tors. This part features, among other topics, generating functions, the Gaussian 
vectors, linear regression and the elementary theory of conditional expectation. 


vil 
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The focus there is on practical computations and only a working knowledge of 
series, of the Riemann integral and of matrices is required. 


Part II: The Essential Theory 


The book then proceeds to the basic theory of probability, starting with a brief sur- 
vey of integration theory, that is then used to revisit, formalize and generalize the 
results of the previous chapters that were either admitted or proved in the special 
framework of discrete probability and continuous random vectors. It introduces 
the various types of convergence of sequences of random variables: almost-sure, 
in probability, in distribution, in variation and in the quadratic mean, featuring 
in particular the strong law of large numbers and the central limit theorem, and 
gives the intermediate, and then the advanced, theory of conditional expectation. 
Chapter 8 is an introduction to martingales, one of the fundamental tools of prob- 
ability. It may be considered to be a continuation of the theme “Convergence of 
sequences”, especially of the chapter on almost-sure convergence. 


The first and second parts provide the material for a basic course in the theory 
of probability. 
Part III: The Important Models 
The results gathered at this point are then applied to the four most important and 
ubiquitous categories of probabilistic models: 
e Markov chains, an omnipresent and most versatile model of applied proba- 
bility, 


e Poisson processes (on the line and in space), which occur in a number of 
applications, ranging from ecology to queuing and mobile communications 
networks, 


e Brownian motion, which models fluctuations of the stock market and the 
“white noise” of physics, and 


e Wide-sense stationary processes, which are of special importance in signal 
analysis and design, and also in the earth sciences. 


An appendix on Hilbert spaces is given for easy reference and self-containedness. 


Each chapter contains a final section with exercises. In the important transition 
chapters 4 and 5, the solutions are given. 
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This book can be used as a text in a variety of ways and at various levels of 
study. Essentially, it provides the material for a two-semester graduate course on 
probability and stochastic processes in a department of applied mathematics, or 
for students in departments where stochastic models play an essential role. 


The progressive introduction of the concepts and of the tools, together with the 
inclusion of numerous examples, also make this book well-adapted to self-study. 


Paris, October 15, 2023 


Pierre Brémaud 
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Chapter 1 


Basic Notions 


Probability theory aims at quantifying randomness. It concerns “experiments” 
(performed by man or Nature, or both) whose outcome is uncertain, and evaluates 
the probability of the resulting events. The meaning of these terms (outcomes, 
events and probability) is given in the so-called aziomatic framework embodied 
in the trinity (Q,F,P), called the probability space, that will be progressively 
introduced in this chapter. 


1.1 Outcomes and Events 


We first recall the notation concerning the basic set operations: union, intersection, 
and complementation. 


If A and B are subsets of some set 2, AUB denotes their union and ANB their 
intersection. In this book, A denotes the complement of A in 2. The notation 
A+ B (the sum of A and B) implies by convention that A and B are disjoint, 
in which case it stands for the union AU B. Similarly, the notation °°, Ax is 
used for Uf2,A, only when the A;’s are pairwise disjoint. The notation A— B 
is used only if B C A, and it stands for AN B. In particular, if B C A, then 
A=B+(A-B). 


A subset of 2 consisting of just one element a € 2 is called a singleton and is 
denoted by {a}. Similar notation is used for sets with a finite number of elements. 
For instance {a, b,c} represents the set consisting of the three distinct elements a, 
band cin. 


The indicator function of the subset A C Q is the function 14 : Q — {0,1} 


defined by 
1 ifweA 
La(w) = : ; 
0 ifw¢ A. 
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 1 
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Let now P be a property that an element of some set X may or may not possess. 
The notations 


1p(x) or Ly, satisfies P} 


stand for f(x), where f(x) = 1 if x satisfies property P, = 0 otherwise. A variety 
of similar notations will be used and should be self-explanatory in a given context. 
For instance, f and g being real-valued functions defined on a set X, 1yp>,}(2) is 
equal to 1 if f(a) > g(x), and to 0 otherwise. 


Random phenomena are observed by means of experiments. Each experiment 
results in an outcome. The collection of all possible outcomes w is called the sample 
space 92. Any subset A of the sample space 2 can be regarded as a representation 
of some event?. 


EXAMPLE 1.1.1: TOSSING A DIE, TAKE 1. The experiment consists in tossing 
a die once. The possible outcomes are w = 1,2,...,6 and the sample space is the 
set Q = {1,2,3,4,5,6}. The subset A = {1,3,5} is the event “result is odd.” 


EXAMPLE 1.1.2: THROWING A DART. The experiment consists in throwing a 
dart at a wall. The sample space can be chosen to be the plane R?. An outcome 
is the position w = (x,y) hit by the dart. The subset A = {(z, y);x? + y? > 1} is 
an event that could be named “you missed the dartboard” (the disk of radius 1 
centered at 0). 


EXAMPLE 1.1.3: HEADS OR TAILS, TAKE 1. The experiment is an infinite 
succession of coin tosses. One can take for the sample space the collection of all 
sequences Ww := {p}ns1, where x, = 1 or 0, depending on whether the n-th toss 
results in heads or tails. The subset A = {w; 2, = 1 for k = 1 to 1,000} is a lucky 
event for anyone betting on heads! 


Probability theory was born out of the study of practical problems (mostly 
gambling, but not exclusively) and this has led the probabilists to develop their 
own dialect which connects their science to reality and favors intuition. 


One says that outcome w realizes event A if w € A. For instance, in the die 
model of Example 1.1.1, the outcome w = 1| realizes the event “result is odd”, since 
1€ A= {1,3,5}. Obviously, if w does not realize A, it realizes A. Event AN B is 
realized by outcome w if and only if w realizes both A and B. Similarly, AU B is 


| However, in general, the appellation “event” will be reserved to a more restricted class of 
subsets. See Definition 1.1.4. 
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realized by w if and only if w realizes at least one event among A and B (both can 
be realized). Two events A and B are called incompatible when AN B = @. In 
other words, event AM B is impossible: no outcome w can realize both A and B. 
For this reason one refers to the empty set @ as the impossible event. Naturally, 
Q is called the certain event. 


Recall now that the notation eae Ay, is used for U2, A; only when the subsets 
Ax are pairwise disjoint. In the terminology of sets, the sets A,, A2,... form a 
partition of Q if S7°° 4 Ay = Q. One then calls events Ay, Ao,... mutually exclusive 
and exhaustive. They are exhaustive in the sense that any outcome w realizes at 
least one among them. They are mutually exclusive in the sense that any two 
distinct events among them are incompatible. Therefore, any w realizes one and 
only one of the events Ay,..., An. 


If B C A, event B is said to imply event A, because w realizes A whenever it 
realizes B. 


Probability theory associates with each event a number, the probability of the 
said event. The collection F of events to which a probability is assigned is not 
always identical to the collection of all subsets of 2. The requirement on F is that 
it should be a o-field, whose definition follows. 


Definition 1.1.4 Let F be a collection of subsets of Q, such that 
(i) the certain event Q is in F, 
(ii) if A belongs to F, then so does its complement A, and 

(iii) ¢f Ar, Ao,... belong to F, then so does their union Up, Ax. 


One then calls F a o-field on Q, here the o-field of events. 


The requirements in the definition are in a sense minimal if you want the o-field 
F to contain the “interesting” events (those for which you are eager to compute the 
probability). Indeed, the complement, the unions and intersections of interesting 
events are most likely interesting events. A natural question at this point is: why 
not accept in general as an event the union of an arbitrary (not just countable) 
collection of events. The answer is given in the next section. 


Note that the impossible event @, being the complement of the certain event 
Q, isin F. Note also that if A,, Ao,... belong to F, then so does their intersection 
NP, A; (see Exercise 1.5.27). 


The collection P(Q) of all subsets of Q and F = {Q, } are called respectively 
the trivial o-field and the gross o-field. 
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If the sample space 22 is finite or countable, one usually (but not always and 
not necessarily) considers any subset of 2 to be an event. That is F = P(Q). But 
this is not true in the general case, both for technical reasons that will be of little 
concern in this course.” Even in the discrete case, this is not necessarily true. For 
instance, suppose that you wish to play heads or tails and have no coin (as an 
inveterate gambler, you are probably broke), but you keep handy a precious die 
in your pocket. You can use the die model of Example 1.1.1, calling “even” heads 
and “odd” tails. That is, you will use the o-field {Q, 2, {1, 3,5}, {2, 4, 6}} instead 
of the trivial o-field. 


Definition 1.1.5 Let Q be an arbitrary set, and let C be a non-empty collection 
of its subsets. The o-field generated by C, denoted by o(C), ts by definition the 
smallest o-field containing all the subsets in C. 


Let us now agree to call an interval of R any convex subset of R: [{a, 6], [a, b), 
(a, b], (a,b), (—o0, b], (—00, d), (a, oo), [a, +00), ( OO, +00). 


Definition 1.1.6 The o-field on IR", denoted by B(R”) and called the Borel o- 
field on R” is, by definition, the smallest o-field on R” that contains all rectangles, 
that is, all sets of the form ea I;, where the I;’s are arbitrary intervals of R. 


In other words, 6(IR”) is the o-field on IR" generated by the rectangles. 


The above definition of the Borel o-field is not constructive and therefore one 
may wonder if there exist sets that are not Borel sets. The theory tells us that 
there are indeed such sets, but they are in a sense “exotic” and never met in 
applications. At this stage, you just have to know that any set for which you have 
once computed the n-volume is in B(R”). 


EXAMPLE 1.1.7: HEADS OR TAILS, TAKE 2. Let F be the smallest o-field that 
contains all the sets {w; x, = 1} (k > 1). It also contains the sets {w; 2, = 1}, 
k, > 1 (pass to the complements), and therefore (take intersections) all the sets of 
the form {w; v7 = @y,...,%n = an} (n> 1, a,..., an € {0,1}). 


1.2 Probability of Events 


The probability P(A) of an event A measures the likeliness of its occurrence. As 
a function defined on F, it is required to satisfy a few properties, the azioms 


? See however the comment in the next subsection. 
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of probability. These are motivated by the following heuristic interpretation of 
P(A) as the empirical frequency of occurrence of event A. If n “independent” 
experiments are performed, among which n, result in the realization of A, then 
the empirical frequency of occurrences of event A, 


should be close to P(A) ifn is “sufficiently large”. (This statement will be clarified 
later on by the law of large numbers.) Clearly, the empirical frequency function F 
satisfies the axioms listed in the following definition. 


Definition 1.2.1 A probability on (Q,F) is a mapping P : F > R. such that 
Gp t= PA) s 1 
(ii) P(Q) =1, and 


(iii) P(URL, An) = SO, P(Ag) whenever the sets A, € F (k > 1) are mutually 
disjoint. 


Property (iii) is called o-additivity. The triple (Q,F, P) is called a probability 
space, or an abstract probability model. 


EXAMPLE 1.2.2: TOSSING A DIE, TAKE 2. An event A is a subset of Q = 
{1,2,3,4,5,6}. The formula 
P(A) = 4 


"6? 
where | A] is the cardinality of A (the number of elements in A), defines a probability 
P. 


EXAMPLE 1.2.3: HEADS OR TAILS, TAKE 3. Choose probability P such that for 
any event of the form A = {x, = @,...,Up = Gn}, where ay,...,@,, are arbitrary 
in {0, 1}, 

P(A) = - . 
Note that this does not define the probability of all events of F. But the theory 
tells us that there exists such a probability satisfying the above requirement and 
that it is unique.® 


3 Tn this book, all the results concerning the existence and uniqueness of probabilities will be 
assumed, as they require a deeper immersion in the theory and are in fact easily admitted. The 
interested reader will find the proofs in [1] or [3] for instance. 
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EXAMPLE 1.2.4: RANDOM POINT IN THE SQUARE. The following is a possible 
model of a random point in the unit square: Q = [0,1]?, F is the collection of 
sets in the Borel o-field B(IR?) that are contained in [0,1]?. The theory tells us 
that there indeed exists one and only one probability P assigning to rectangles 
therein their area in the usual sense, called the Lebesgue measure on [0, 1]?, which 
formalizes the intuitive notion of area. 


The probability of Example 1.2.2 suggests an unbiased die, where the outcomes 
1, 2, 3, 4, 5 and 6 are equiprobable. As we shall see later on, the probability P of 
Example 1.2.3 implies an unbiased coin and independent tosses (the emphasized 
terms will be defined later). 


We now answer a question that the reader may have in mind. Why, for instance 
in Example 1.2.4 above, don’t we take for the o-field of events the trivial o-field? 
The answer is easy, although its proof is not immediate and belongs to an advanced 
course on measure theory: there exists no probability P on the trivial o-field on 
the square [0,1]? that assigns to rectangles therein their area in the usual sense. 


Another question is: why impose only o-additivity, and not unrestricted addi- 
tivity 
(P(UierAi) = YUje, P(Ai) where the index set J is arbitrary and the A;’s are 
mutually disjoint)? In fact, if unrestricted additivity was part of the definition 
of probability, there would exist no such “probability” on the Borel o-field on 
the square [0, 1]? assigning to rectangles therein their area in the usual sense (see 
Exercise 1.5.5). 


We now list some properties that follow directly from the axioms: 
Theorem 1.2.5 For any event A 
P(A) =1- P(A), Ge) 


and 
P(@) =0. (1.2) 


Proof. For a proof of (1.1), use additivity: 
1 = P(Q) = P(A+ A) = P(A)+ P(A). 
Applying (1.1) with A = ( gives (1.2). 


Theorem 1.2.6 Monotonicity: 


AC B= P(A) < P(B). (1.3) 
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Proof. Observe that B = A+(B-— A) when AC B, and therefore 
P(B) = P(A) + P(B-—A)> P(A). 


Theorem 1.2.7 Sub-o-additivity: 


Co 


P(UR Ak) < a (1.4) 


See Exercise 1.5.7. 


The next property, the sequential continuity of probability, is close to a tautol- 
ogy and yet extremely useful. 


Theorem 1.2.8 Let {An}n>i be a non-decreasing sequence of events, that is, 
Ana > A, (a= I). Moen 


PUPS, Ala) = ltt ess (Aba) (1.5) 
Proof. Write 
An = Ay + (Ag — Ai) +--+ + (An — An-1) 
and 
Uj Ag = Ai + (Ag — Ai) + (Az — Aa) +-°- 
Therefore, 


P(UR2 Ag) = P(A1) + s P(A; — Aj-1) 


j=2 


= {Pay +50 P(A; - aa} = im P(A,): 


n 
j=2 


Corollary 1.2.9 Let {B,}n>1 be a non-increasing sequence of events, that is, 
Brat c By (n = ve Then, 


P(A&, By) = limpsoo P(Bn) - (1.6) 


See Exercise 1.5.8. 


A central notion of probability is that of a negligible set. Its importance is due 
to the fact that probabilistic calculations bear on the probability of events, not on 
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the events themselves. One will never be able to say that an event such as “the 
empirical frequency of heads in an infinite sequences of independent tosses of a 
fair coin is equal to 3” is certain, that is, is identical to Q. One will only be able 
to prove that the complementary event has null probability. 


Definition 1.2.10 A set N C Q is called P-negligible if it is contained in an 
event A € F of null probability. 


Note that the set N need not be an event (an element of F). An event that is 
negligible set will of course be called a negligible event. 


Theorem 1.2.11 A countable union of negligible sets is a negligible set. 


Proof. Let N;, (& > 1) be P-negligible sets. By definition there exists a sequence 
A, (k > 1) of events of null probability such that N;, C A, (k > 1). We have 


N = Up>i.Ng © A t= Upsi Ag , 


and then P(A) = 0, by the sub-o-additivity property of probability. 


EXAMPLE 1.2.12: RANDOM POINT IN THE SQUARE, TAKE 2. Each rational 
point of the square considered as a set (a singleton) has a null area and therefore 
null probability. Therefore, in this model, the (countable) set of rational points 
of the square has null probability. In other words, in this particular model, the 
probability of drawing a rational point is null. 


1.3. Independence and Conditioning 


Recall the heuristic frequency interpretation of probability at the beginning of 
Section 1.2. A situation where 

NANB NA 

NB n 


(here © is a non-mathematical symbol meaning “approximately equal”) suggests 
some kind of “independence” of A and B, in the sense that statistics relative to A 
do not vary when passing from a neutral sample of population to a selected sample 
characterized by the property B. For example, the proportion of people with a 
family name beginning with H is the same among a large population with the 
usual mix of men and women as it would be among a large all-male population. 
This prompts us to give the following formal definition of independence, the single 
most important concept of probability theory. 
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Definition 1.3.1 Two events A and B are called independent if 
P(ANB) = P(A)P(B). (1.7) 
One should be aware that incompatibility does not mean independence. As a 
matter of fact, two incompatible events A and B are independent if and only if 


at least one of them has null probability. Indeed, if A and B are incompatible, 
P(AN B) = P(@) =0, and therefore (1.7) holds if and only if P(A)P(B) = 0. 


The notion of independence carries over straightforwardly to families of events. 


Definition 1.3.2 A sequence {An}nen of events is called independent if for any 
finite set of indices i,,...,i, EN, 


P (Aj, N Aig 1+ + +9 Aij,) = P(A;,) x P(Ai,) Roves x P(A;,). 
One also says that the A,’s (n € N) are jointly independent. 


Definition 1.3.3 The conditional probability of A given B is the number 


P(A| B) = PYOB, a8) 


defined when P(B) > 0. If P(B) = 0, one defines P(A | B) arbitrarily between 0 
and 1. 


In particular, if A and B are independent, then P(A | B) = P(A). 


The quantity P(A | B) represents our expectation of A being realized when 
the only available information is that B is realized. The corresponding heuristic 
quantity is the relative frequency nang/np. 


Probability theory is primarily concerned with the computation of probabilities 
of complex events. The following formulas, the so-called Bayes rules, are not only 
useful, but indispensable. They lay the foundations of the elementary calculus of 
probability and give the first opportunity to solve simple yet non-trivial problems. 


Theorem 1.3.4 With P(A) > 0, we have the Bayes rule of retrodiction: 


P(B| A) = “Shar (1.9) 


Proof. Rewrite (1.8) symmetrically in A and B: 
P(AN B) = P(A| B)P(B) = P(B| A)P(A). 
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Theorem 1.3.5 Let By, Bo,... be events forming a partition of Q, that is such 
that )>*, B; =Q. Then for any event A, we have the Bayes rule of total causes: 


y= a P(A| B,)P(B)- (1.10) 


Proof. Decompose A as follows: 


A=ANQN=AN (3:8) =S(AnB). 


i=1 


Therefore (by o-additivity and by definition of conditional probability): 


P(A)=P (du n a) 


P(ANB) = 5° P(A| Bi) P(Bi). 


i=l i=1 


T 
heal 


EXAMPLE 1.3.6: DIPLOIDS AND THE HARDY—WEINBERG LAW. In diploid 
organisms (you for instance) each hereditary character is carried by a pair of genes. 
Consider the situation in which a given gene can take two forms called alleles, 
denoted a and A. Such was the case in the historical experiments performed in 
1865 by the Czech monk Gregory Mendel, who studied the hereditary transmission 
of the nature of the skin in a species of green pea. The two alleles corresponding 
to the gene or character “nature of the skin” are a for “wrinkled” and A for 
“smooth”. The genes are grouped into pairs and there are two alleles, thus three 
genotypes are possible for the character under study: aa, Aa (same as aA), and 
AA (4). During the reproduction process, each of the two parents contributes to 
the genetic heritage of their descendant by providing one allele of their pair. This 
is done by intermediaries of the reproductive cells called gametes (in the human 
species, the spermatozoid and the ovula) which carry only one gene of the pair of 
genes characteristic of each parent. The gene carried by the gamete is chosen at 
random among the pair of genes of the parent. The actual process occurring in 
the reproduction of diploid cells is called meiosis. 


4 With each genotype is associated a phenotype which is the external appearance correspond- 
ing to the genotype. Genotypes aa and AA have different phenotypes —otherwise no character 
could be isolated—, and the phenotype of Aa lies somewhere between the phenotypes of aa and 
AA. Sometimes, an allele is dominant, that is, A, and the phenotype of Aa is then the same as 
the phenotype of AA. 


1.3. INDEPENDENCE AND CONDITIONING 11 


A given cell possesses two chromosomes. A chromosome can be viewed as 
a string of genes, each gene being at a specific location in the chain. A given 
chromosome duplicates itself and four new cells are formed for every chromosome 
(see the figure below). One of the four gametes of a “mate” (say, the ovula) chosen 
at random selects randomly one of the four gametes of the other “partner” (say, 
the spermatozoid) and this gives “birth” to a pair of alleles. 


One parent cell 


Va 


VY © 


Four gametes 


Figure 1.1: Meiosis. 


Let us start from an idealistically infinite population where the genotypes are 
found in the following proportions: 


AA : Aa: aa 


ua: 2z: y. 


Here x, y, and z are numbers between 0 and 1, and x+2z+y = 1. The two parents 
are chosen independently (random mating), and their gamete chooses an allele at 
random in the pair carried by the corresponding parent. 


We seek the genotype distribution of the second generation. Our first task 
consists in providing a probabilistic model. We propose the following one. The 
sample space 2 is the collection of all quadruples w = (21, 72, y1, y2) Where x; and 
x2 take their values in {AA,aA,aa}, and y; and y2 take their values in {A, a}. 
Four “coordinate functions” X,, X2,Y,,¥2 are defined by X,(w) = x, X2(w) = 
x2, Yi(w) = y, and Yo(w) = ya. We interpret X; and X2 as the pairs of genes in 
parents 1 and 2 respectively. Y; is the allele chosen by gamete 1 among the alleles 
of X,, with a similar definition for Y2. The data available are, for the selection of 
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parents: 


P(X, = aa) = P(X, = aa) = y, 
P(X, = Aa) = P(X_ = Aa) = 22, 


and for the choice of allele by gamete 1: 


P(Y, A | Xy AA) 1. P(Y, a | X1 AA) 0, 
P(Y,=A|X,=aa)=0, P(Y, =a| X,=aa)=1, 
Pie AG Ade 4 Care ee > 


and the similar data for the choice of allele by gamete 2. One must also add the 
assumptions of independence of X, and X»y and of Y; and Y3. We are required to 
compute the genotype distribution of the second generation, that is, 


p=P(Y,=A,Y,=A), 
qg=P(% =4,¥2 =a), 
ar = P(Y, = A, Yo =aor Y, =a, Yo =A). 


We start with the computation of p. In view of the independence of Y; and Ya, 
p= P(Y, = A)P(¥ = A). By the rule of total causes, 


P(Y; = A) = P(Y, = A| X, = AA)P(X, = AA) 
+P(¥,=A|X. = A,)P(X1 = Aa) 
+ P(Y, = A|X, = aa)P(X, = aa) 


=1-2r4 


=-22+0-y=atsz. 
5 2% y=2+2z 


Therefore p = (a + z)? and by symmetry, g = (y + z)?. Now 2r = P(Y, = A, Yo 
a) + P(Y, = a, Y2 = A), and therefore by symmetry, r = P(Y; = A, Y2 =a). In 
view of the independence of Y; and Y2,r = P(Y; = A)P(¥2 =a). Finally, in view 
of previous computations, 2r = 2(a + z)(y + z). 


Theorem 1.3.7 For any sequence of events A,,...,A,, we have the Bayes se- 
quential formula: 


P (Nk, Ai) = P(A) P(A | A1)P(A3 | Ar. A2)-++P (Ag | OED Ai). (1-11) 
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Proof. By induction. First observe that (1.11) is true for k = 2 by definition of 
conditional probability. Suppose that (1.11) is true for k. Write 
P (MEH A,) = P ((N_, A:) M Anyi) 
= P (Agi | ML, Ai) P (NL, A:) , 


and replace P (N_, Ai) by the assumed equality (1.11) to obtain the same equality 
with k + 1 replacing k. 


EXAMPLE 1.3.8: SHOULD ONE ALWAYS BELIEVE DOCTORS? Doctors apply a 
test that gives a positive result in 99% of the cases where the patient is affected 
by the disease. However it happens in 2% of the cases that a healthy patient has a 
positive test. Statistical data show that one individual out of 1000 has the disease. 
We compute the probability for a patient with a positive test to be affected by the 
disease. 


Let M be the event “patient is ill,” and let + and — be the events “test is 
positive” and “test is negative” respectively. We have the data 


P(M) = 0.001, P(+| M) = 0.99, P(+| M) = 0.02, 
and we must compute P(M | +). By the retrodiction formula, 


P(+ | M)P(M) 


P(M | +)=——5r5 


By the formula of total causes, 


P(+) = P(+| M)P(M)+ P(+| M)P(M). 


Therefore, 
| 7 (0.99)(0.001) 
P(M 1+) = (gay 001) + (0.02)(0.599) ’ 


that is, approximately 0.005. 


One might have mixed feelings concerning the reliability of the pharmaceutical 
company that manufactures such a test. There is however a possible explanation. 
For certain types of illness, it is much preferable to provoke many false alarms 
than to fail to detect the illness. This is the case in group testing, where the 
blood samples are mixed. If this mixed sample is positive, the patients are tested 
individually with a more reliable, in general more expensive, test. 
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EXAMPLE 1.3.9: 'THE BALLOT PROBLEM. In an election, candidates I and II 
have obtained a and 6 votes respectively. Candidate I won, that is a > b. We shall 
compute the probability that in the course of the vote counting process, candidate 
I has always had the lead. 


Let pa,» be the probability that A is always ahead. By the Bayes rule of total 
causes, and conditioning on the last vote: 


Pap = P(A always ahead |A gets last vote )P(A gets last vote ) 
+ P(A always ahead |B gets last vote )P(B gets last vote ) 
a 
a Pa-1bo TG ar Pab-17 , 
with the convention that for a = b+ 1, pe-1» = Po» = 0. The result follows by 
induction on the total number of votes a+ b: 
a—b 


Pab = aek 


Definition 1.3.10 Let A, B, and C be events, with P(C) > 0. One says that A 
and B are conditionally independent given C if 


P(ANB|C)=P(A|C)P(B|C). (1.12) 


In other words, A and B are independent with respect to the probability Po 
defined by Po(A) = P(A|C) (see Exercise 1.5.18). 


EXAMPLE 1.3.11: CHEAP WATCHES. Two factories A and B manufacture 
watches. Factory A produces on average one defective item out of 100, and B 
produces on average one bad watch out of 200. A retailer receives a container of 
watches from one of the two above factories, but he does not know which. He 
checks the first watch. It works! 


(a) What is the probability that the second watch he will check is good? 


(b) Are the states of the first two watches independent? 


Solution: (a) Let X,, be the state of the nth watch in the container, with X, = 1 
if it works and X, = 0 if it does not. Let Y be the factory of origin. We express 
our a priori ignorance of where the case comes from by 

1 

5° 


P(Y =A) =P(Y =B) 
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Also, we assume that given Y = A (resp., Y = B), the states of the successive 
watches are independent. For instance, 


P(X, =1,X%,=0|Y =A) =P(X,=1|/¥Y =A)P(X,=0|Y =A). 
We have the data 
P(X, =0| Y = A) = 0.01, P(X, =0|Y = B) = 0.005. 
We are required to compute 


P(X, = 1, X2 = 1) 
P(X, =1|X, = 1) = -—-—— 
Vast a2) P(X, =1) 
By the formula of total causes, the numerator of this fraction equals 


P(X, =1,X,=1|Y =A)P(Y =A)+--- 
++ P(X, =1,X.=1|Y =B)P(Y =B), 


that is, (0.5)(0.99)? + (0.5)(0.995)?, and the denominator is 


P(X, =1|Y =A)P(Y = A) + P(X, =1|Y =B)P(Y =B), 


that is, (0.5)(0.99) + (0.5)(0.995). Therefore, 


2 2 
P(Xp=1|X=) = (0.99) + (0.995) 
0.99 + 0.995 


(b) The states of the two watches are not independent. Indeed, if they were, 
then 


P(X_ =1| X; =1) = P(X = 1) = (0.5) (0.99 + 0.995), 


a result different from what we obtained. 


The example above shows that two events A and B can be conditionally inde- 
pendent given C' and conditionally independent given C’, and yet not be indepen- 
dent. 


1.4 Counting Models 


A number of problems in Probability reduce to counting the elements of finite sets. 
The general setting is the following. 
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The set of all possible outcomes, {, is finite, and for some reason (symmetry for 
instance) one is led to believe that all the outcomes w have the same probability. 
Since the probabilities sum up to one, each outcome has probability ae where |Q| 
denotes the cardinality (the number of elements) of the set 2. Since the probability 
of an event A is the sum of the probabilities of all outcomes w € A, we have 


P(A) = - (1.13) 


Thus, computing P(A) requires counting the elements in the sets A and Q. 


There is a whole branch of mathematics devoted mainly to counting, called 
combinatorics. A basic item of combinatorics is the binomial coefficient. The bi- 
nomial coefficient expresses the number of fixed-size subsets of a finite set. Suppose 
we have a set F' containing n elements denoted by 1,2,...,. How many different 
subsets of p elements of F are there? If we denote this number, called the binomial 
coefficient, by ( ys then we have 


(?) 7 Cae (1.14) 


Proof. To prove this formula, we proceed in two steps. First we shall determine 
the number of possible ordered sequences of p elements taken from F’ without 
repetition. (Note the difference between an ordered sequence of p elements without 
repetition and a subset of p elements: a subset such as {1,2,3} for instance gives 
rise to 6 ordered sequences of 3 elements taken from F’ without repetition: 


(1, 2,3), (2,3, 1), (3, 1, 2), (8, 2,1), (1,3, 2), (2, 1, 3).) 


To make an ordered sequence of p elements taken from F' without repetition, we 
must first select the first element: there are n choices. Having selected the first 
element, there remains only n — 1 choices for the second element since we exclude 
repetitions. We proceed in this way up to the last element, which must be chosen 
among the n — p+ 1 remaining elements. Thus the number A(n,p) of ordered 
sequences of p elements taken from F’ without repetition is 


n(n —1)(n—2)...(n—p+1), 
that is 
n! 
(n—p)! 


In particular, the number of ordered sequences of length p that one can obtain 
from a given set of length p is A(p,p) = p!, and since each subset of p elements 


A(n, p) = 
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gives rise to exactly p! ordered subsequences of length p, we have 


(?) 7 ea = 


Let now F' be a finite set with n elements. How many subsets of F’ are there? 
One could answer with 26 (i and this is true if we use the convention that 
) = 1 or equivalently 0! = 1. (Recall that the empty set @ is a subset of F’, and 
it is the only subset of F with 0 elements. With the above conventions, formula 
(1.14) also holds for p = 0.) Therefore, anticipating the binomial formula (1.15), 


this number is 2 
7 n 
n= Se ("). (*) 


p=0 


which is formula (1.14). 


However one can prove directly that the number of subsets of F' is 2”. 


Proof. Let 71,72,...,2%, be an enumeration of the elements of F’. To any subset 
of F there corresponds a sequence of 0’s and 1’s of length n, where there is a 1 
in the ith position if and only if x; is included in the subset. Conversely, to any 
sequence of 0’s and 1’s of length n, there corresponds a subset of F’ consisting of all 
x;’s for which the 7th digit of the sequence is 1. Therefore, the number of subsets 
of F’ is equal to the number of sequences of length n of 0’s and 1’s, which is 2”. 


(This method of proof, consisting in establishing a bijection with a set which 
is easy to count, is fundamental in combinatorics.) 


Formula (x) is a particular case of the binomial formula 


(ety) = ‘3 (") ayn? (1.15) 


p=0 


Letting « = y = 1 indeed gives (x). 


Proof. (of the binomial formula) Let 2;,y; (1 < i <n) be real numbers. The 
product [[j_,(«i + y;) is the sum of all possible products j,i, +++ Ti,Yj. °°" Yjn—p 
where {i1,..., ip} is a subset of {1,...,n} and {j1,...,jn—p} is the complement of 
{ii,---,%p)} in {1,...,n}. Therefore, 


i=1 P=0—{in,--ip} 
{i1,sip}C{1-~ n} 
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The second sum in the right-hand side of this equality contains (°) elements, since 
there are () different subsets {i,,...,%,} of p elements of {1,...,n}. Now, letting 
t= 2,y; = y (1 <i <n), we obtain the binomial formula. 


From (1.14) it follows immediately that 


()-(,) 0 


EXAMPLE 1.4.1: AN URN PROBLEM. {From an urn containing N;, black balls 
and Np» red balls, you draw at random, successively and without replacement, n 
(n < N, + No) balls. The probability of having drawn k black balls (0 < k < 


inf(N,,7)) is: “ ve 


i= AGE 1ay 
oe _ 


Proof. The set of outcomes ( is the family of all subsets w of n balls among the 
N, + No balls in the urn. Therefore, 


N. 
a= (7 " 


It is reasonable to suppose that all the outcomes are equiprobable (the urn should 
be properly shaken before catching the balls with a net). Therefore, formula (1.13) 
applies and one must count the subsets w with k black balls and n — k red balls. 
To form such a set, you first form a set of k black balls among the N, black balls, 
and there are eo) possibilities. To each such subset of & black balls, you must 
associate a subset of nm — k red balls. This multiplies the possibilities by eis 
Thus, if A is the number of subsets of n balls among the N; + No balls in the urn 
which consist of k black balls and n — k red balls, then 


a )e): 


For future reference, we quote here the negative binomial formula: 


1 2 
a-armie( Plea (PT ee (Pet, (1.18) 


where z € ©, |z| < 1. (Hint for the proof: For p > 2, (1 — z)~? is the (p — 1)-th 
derivative of (1 — z)~.) 
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Poincaré’s Formula 


Elementary computations give 


P(A, U Ag) = P(Ay) + P(Ag) — P(AiN Ag) 


and 
— P(A, M Az) — P(AiM Ag) — P(AgN As) 
+ P(A, N Ag N As). 


More generally (Exercise 1.5.9): 


Theorem 1.4.2 Let P be a probability on some measurable space (Q,F) and let 
A,,...,A, be arbitrary events. Then 


Pie) So )- 5 P(Ain Aj) 
i<j 
+ $2 P(A;N.APN Ag) = + (HD) P(ALN AQ N-++ Ap). (1.19) 


i<j<k 


EXAMPLE 1.4.3: EULER’S FORMULA. Let p(n) denote the number of integers k 
(2 <k <n) that are prime with the integer n > 2 (the function y is called Euler’s 


function). Euler proved that 
p(n) 1 
= — 
n II ( >) 


pin 


where the product is over all the prime numbers p that divide n. The proof below 
uses Poincaré’s formula. 


The integer n > 2 has a (unique) representation as 


n= ptt opt 
where p1,...,p, are distinct prime numbers > 1. Let Q := {1,2,...,n}. Take for 
probability the uniform probability on , that is 


4 


where |A| denotes the cardinality of A. Poincaré’s formula then reads: 
| Ui l= Dia 0 A)| 
i<g 
+ So |ApN A; Ag] — ++ + (HD) AL A299. 


i<j<k 
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We shall apply it to the sets 
A,, := {integers divisible by p,} (k= 1,...,1r). 


In particular A; U Ap U---U A, is the set of integers divisible by at least one 
of the integers p;,...,p, and A; U Ag:--UA, is the set of integers that are not 
divisible by any of the p),...,p,, that is the set of integers that are prime with n. 
Therefore, 

|A,N A2gN-:-NA,| =n— y(n). 


Applying Poincaré’s formula, and noting that 


* (i <§),---)|A1 1 AgA--.A,| = ——, 


iPj Pi-** Pr 


|Adl ==, [Ai Aj] = 
Pi 


we obtain that 


n 
n ae Bs +(-1)" os 
“a Pi a PiP;j Pi‘ * Pr 
i<j 
Therefore 
n 

j)=n eae ( i ia ; 
y Pi ire Pi-**DPr 


which is Euler’s formula. 


The next example formalizes the ebriate postman problem, in which letters are 
randomly distributed in the mailboxes. 


EXAMPLE 1.4.4: COINCIDENCES. An urn contains n balls, each one with a 
different number from 1 to n. One draws the balls, one by one, in succession 
and without replacement. Each time a ball is drawn, one notes its number. The 
randomness of the procedure is formalized by a random permutation a of the set 
{1,2,...,n}, all the permutations being equiprobable, that is, oo being a given 


permutation, 
1 
P(o =00) = a 
One says that there is a coincidence at the i-th sample if o(7) = 7, and we denote 
by E; the corresponding event. We shall compute the probability that there is at 
least one coincidence occurring in the sequence of successive drawings. This event 
is 
A:=E,UE)U:::-UE,. 
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(i < ig < +++ < ig), 


\ (n — k)! 
PE Et = Ge — 


so that, by Poincaré’s formula 


pay= yen ({) Ey 


k=1 


This quantity tends to 1 — e~! ~ 0.63212 as n ft co. 


1.5 Exercises 


Exercise 1.5.1. DE MORGAN’S RULES 
(a) Let {A,}n>1 be an arbitrary sequence of subsets of 2. Prove De Morgan’s 


identities: 
(Aa) U5 and ( 4) =f) Aw 
n=1 n=1 1 n=1 


(b) Prove that if F is a o-field on Q, and if Aj, Ao,... belong to F, then so does 
their intersection NP, Ax. 


co 


n= 


Exercise 1.5.2. FINITELY OFTEN, INFINITELY OFTEN 
Let {A,}n>1 be an arbitrary sequence of subsets of 22. 


(a) Show that w € B:= VU, (2, Ax if and only if there exists at most a finite 
number (depending on w) of indices & such that w € Ax. 


(b) Show that w € D:= (\7_, UZ, Ax if and only if there exist an infinite number 
(depending on w) of indices k such that w € Ag. 


Exercise 1.5.3. INDICATOR FUNCTIONS 
Prove the following identities for all subsets A, B of a given set Q, and all sequences 
{A,},,>; forming a partition of 0: 


lang = la X 1p, 1z=1-1,, ee Laie: 


n>1 


Exercise 1.5.4. UNION OF o-FIELDS 
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Let F, and F2 be two o-fields on the set Q. Give a counterexample contradicting 
the assertion that 7, U Fo is a o-field. 


Exercise 1.5.5. WHY JUST o-ADDITIVE? 

Consider the probability model of Example 1.2.4 (random point on the square). 
Prove that there exists no totally additive probability P on the Borel o-field on 
the square [0,1]? that assigns to rectangles therein their surface. (By “totally 
additive”, it is meant that the probability of the union of an arbitrary —not 
necessarily countable— collection of mutually disjoint sets in the Borel o-field is 
the sum of the individual probabilities.) 


Exercise 1.5.6. IDENTITIES 
Let (Q,F, P) be a probability space and let A and B be events (€ F). Prove the 
identities 


P(AU B) =1-— P(ANB), P(AU B) = P(A) + P(B) — P(ANB). 


Exercise 1.5.7. SUB-o-ADDITIVITY 
Let (Q,F, P) be a probability space. Prove the sub-c-additivity property: for any 
sequence {A,}n>1 of events, 


(Ua) <r. 


Exercise 1.5.8. SEQUENTIAL CONTINUITY, THE DECREASING CASE 
Prove Corollary 1.2.9. 


Exercise 1.5.9. POINCARE’S FORMULA 
Let P be a probability on some measurable space (Q,) and let Ai,...,A, be 
arbitrary events. Prove that 


P(UL Ai) = >5 P(A) — ) P(A Ay) 
+ $2 P(A;NA;M Ag) — + + (HDEP(ALN A271). (1.20) 


i<j<k 


Exercise 1.5.10. ROLL IT! 
You roll fairly and simultaneously three unbiased dice. What is the probability 
that one die shows 4, another 2, and another 1? 
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Exercise 1.5.11. ONE IS THE SUM OF THE TWO OTHERS 

You perform three independent tosses of an unbiased die. What is the probability 
that one of these tosses results in a number that is the sum of the two other 
numbers? 


Exercise 1.5.12. URNS 


1. An urn contains 17 red balls and 19 white balls. Balls are drawn in succession 
at random and without replacement. What is the probability that the first 2 balls 
are red? 


2. An urn contains N balls numbered from 1 to N. Someone draws n balls 
(1 <n < N) simultaneously from the urn. What is the probability that the lowest 
number drawn is k? 


Exercise 1.5.13. HEADS OR TAILS AS USUAL 

A person, A, tossing an unbiased coin N times obtains T', tails. Another person, 
B, tossing her own unbiased coin N +1 times has Tz tails. What is the probability 
that T, > Tp? 


Exercise 1.5.14. EXTENSION OF THE BASIC FORMULA OF INDEPENDENCE 
Let {Cy }n>1 be a sequence of independent events. Then 


P(M1Cn) = Mpa P(Cn) - 


This extends formula (1.7) to a countable number of sets. 


Exercise 1.5.15. THE SWITCHES 

Two nodes A and B in a communications network are connected by three different 
routes and each route contains a number of links that may fail. These are repre- 
sented symbolically in Fig. 1.2 by switches that are in the lifted position if the link 
is in a failure state. In this figure, the number associated with a switch is the prob- 
ability that the corresponding link is out of order. The links fail independently. 
What is the probability that A and B are connected? 


Exercise 1.5.16. PAIRWISE INDEPENDENCE DOES NOT SUFFICE 

Give a simple example of a probability space (Q, F, P) with three events Aj, Ao, As 
that are pairwise independent, but not globally independent (that is, the family 
{A,, Az, Az} is not independent). 


Exercise 1.5.17. INDEPENDENT FAMILY OF EVENTS 
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0.1 ~ 0.1 ~ 


Figure 1.2: All switches up. 


If {A;}ier is an independent family of events, is it true that {Aj}ic, is also an 
independent family of events, where for each i € I, A; = A; or A; (your choice, for 


instance, with J = N, Ap = Ao, A = Ay, A3 = As,.. )? 


Exercise 1.5.18. CONDITIONAL INDEPENDENCE AND THE MARKOV PROPERTY 
1. Let (Q, F, P) be a probability space. For a fixed event C’ of positive probability 
define Po(A) := P(A | C). Show that Po is a probability on (Q,F). (And note 
that A and B are independent with respect to this probability if and only if they 
are conditionally independent given C.) 


2. Let A;, Ao, A3 be three events of positive probability. Show that events A, and 
Az are conditionally independent given A» if and only if the “Markov property” 
holds, that is, P(As | A, NM Ap) = P(As | Ag). 


Exercise 1.5.19. ROLL IT ONCE MORE! 


You roll fairly and simultaneously three unbiased dice. What is the probability 
that some die shows 1, given that the sum of the 3 values equals 5? 


Exercise 1.5.20. SOCIAL APARTHEID UNIVERSITY 


In the renowned Social Apartheid University, students have been separated into 
three social groups for “pedagogical” purposes. In group A, one finds students who 
individually have a probability of passing equal to 0.95. In group B this probability 
is 0.75, and in group C only 0.65. The three groups are of equal size. What is the 
probability that a student passing the course comes from group A? B? C? 


Exercise 1.5.21. WISE BET 

There are three cards. The first one has both faces red, the second one has both 
faces white, and the third one is white on one face, red on the other. A card is 
drawn at random, and the color of a randomly selected face of this card is shown 
to you (the other remains hidden). What is the winning strategy if you must bet 
on the color of the hidden face? 
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Exercise 1.5.22. A SEQUENCE OF LIARS 

Consider a sequence of n “liars” [,,...,£,. The first liar LZ, receives information 
about the occurrence of some event in the form “yes or no” and transmits it to Do, 
who transmits it to D3, etc... Each liar transmits what he hears with probability 
p € (0,1), and the contrary with probability ¢g = 1—p. The decision of lying or 
not is made by each liar independently of the rest of his colleagues. What is the 
probability x, of obtaining the correct information from L,,? What is the limit of 
Ln as n increases to infinity? 


Exercise 1.5.23. THE CAMPUS LIBRARY COMPLAINT 

You are looking for a book in the campus libraries. Each library has it with 
probability 0.60 but in each library the book may have been stolen with probability 
0.25. If there are three libraries, what are your chances of obtaining the book? 


Exercise 1.5.24. SAFARI BUTCHERS 

Three tourists participate in a safari in Africa. They encounter an elephant, who is 
unaware of the rules of the game. The innocent beast is killed, having received two 
out of the three bullets simultaneously shot by the tourists. The hit probabilities 
of the tourists are: Tourist A: +, Tourist B: 3, Tourist C: 3. Give for each tourist 


4 
the probability that he was the one who missed. 


Exercise 1.5.25. THE HARDY-WEINBERG LAW 

In Example 1.3.6, show that the genotypic distributions of all generations, starting 
from the third one, are the same and that the stationary distribution depends only 
on the proportion c of alleles of type A in the initial population. 


Exercise 1.5.26. SLUMBERIDGE UNIVERSITY ALUMNI 

A student from the famous Veryhardvard University has with probability 0.25 a 
bright intelligence. Students from the Slumberidge University have a probability 
0.10 of being bright. You find yourself in an assembly with 10 Veryhardvard 
students and 20 Slumberidge University students. You meet a handsome girl 
(resp. boy) whose intelligence is obviously superior. What is the probability that 
she (resp. he) registered at Slumberidge University? 


Exercise 1.5.27. OPERATIONS ON EVENTS 
Let F be a o-field on some set 2. Show that if A; and A» are in Ff, then so is 
their symmetric difference AyA Az := A, U Ag — A, Ao. 


Exercise 1.5.28. SMALL o-FIELDS 
Is there a o-field on 2 with 6 elements (including of course Q and @)? 
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Exercise 1.5.29. ATOMS 
Let the non-empty subsets A,,...,A, of a set 2 form a partition of the latter. 


(a) How many elements are there in the o-field F they generate on (2? (The sets 
Aj,..., Ap are called the atoms of F.) 


(b) Show that if a o-field F on (Q contains a finite number of elements, it is 
generated by a finite number of sets that form a partition of Q. 


Exercise 1.5.30. LOST UMBRELLA 

With probability p € (0,1) the umbrella that you have lost is, equiprobably, in 
one of the seven floors of a given building. You have explored without success six 
floors. What is the probability that you will find your umbrella on the seventh 
floor? 


Exercise 1.5.31. ROLLING DICE 
Two (fair) dice are rolled independently in succession. Show that the event “the 
sum obtained is 7” is independent of the number shown by the first die. 


Exercise 1.5.32. THE FIVE COINS 
There are five fair coins, two have an A written on both faces, one has a B on both 
faces, and two have an A on one face and a B on the other face. 


(a) Someone picks a coin at random and tosses it. What is the probability that 
the lower face has an A on it? 


(b) Keeping your eyes shut, you pick a coin at random, take this coin into another 
room and toss it. You open your eyes and see that the upper face shows an A. 
What is the probability that the lower face has an A on it? 


Exercise 1.5.33. PROOF-READING 

A book contains four errors. Each time it is proof-read, a so far uncorrected 
error is corrected with probability :. The corrections of the different errors are 
independent. So are the proof-readings. How many proof-readings are necessary 
for the probability that no error is left to be larger than 0.9? (Hint: take for the 
probability space the set of 4-tuples (a, a2, a3, a4) of positive integers, where a; is 
the number of proof-readings necessary to get rid of error i.) 


Check for 
updates 


Chapter 2 


Discrete Random Variables 


The number of heads in a sequence of 10,000 coin tosses, the number of days it 
takes until the next rain and the size of a genealogical tree are random numbers. 
All are functions of the outcome of a random experiment performed either by 
man or nature taking discrete values, that is, values in a countable set. In the 
above examples, the values are numbers, but they can be of a different nature, for 
instance graphs!. 


2.1 Probability Distribution and Expectation 


Definition 2.1.1 Let E be a countable set. A function X :Q— E such that for 
alae E 
{w;X(w) =a} eF 


is called a discrete random variable or discrete random element. 


Being in F, the event {X = x} can be assigned a probability. 


Definition 2.1.2 From the probabilistic point of view, a discrete random variable 
X is described by its probability distribution function (or distribution, for short) 


{7(2)} rep, where 


Since E is a countable set, it can always be identified with N or N := NU {oo}, 
and therefore we shall often assume that either EF = N or N. 


' See for instance [6]. 
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Calling a random variable taking integer values a “random number” is an 
innocuous habit as long as one is aware that it is not the function X that is 
random, but the outcome w. This in turn makes the number X(w) random. 


One is sometimes faced with the problem of proving that a random variable X, 
taking its values in N (and therefore for which the value 00 is a priori possible), 
is in fact almost surely finite. That is, we have to prove that P(X = co) = 0 or, 
equivalently, that P(X < oo) =1. Since 


{X < oo} = Viol X =n}, 


we have 


PX 200) =) PA aw 


This remark provides an opportunity to recall that in an expression such as 
yo 9, the sum is over N and does not include oo as the notation wrongly suggests. 
A less ambiguous notation would be }>,,<j.. However, we shall stick to the classical 
notation and in the case where the summation is over all integers plus oo, we shall 


always use the notation >>, cy: 


In the vein of the above simple rules, let us mention the following often used 
expression of P(X < oo) for an integer-valued random variable X: 


P(X <c) = a Pe <n). 


For the proof (because it requires one), observe that the events A, := {X > n} 
(n > 0) form a non-decreasing sequence and that U,A, = {X < oo}. The result 
then follows by sequential continuity of probability (Theorem 1.2.8). 


EXAMPLE 2.1.3: TOSSING A DIE, TAKE 3. In this example, the sample space 
is Q = {1,2,3,4,5,6}. Take for X the identity: X(w) = w. In that sense X is a 
random number obtained by tossing a die. 


EXAMPLE 2.1.4: HEADS OR TAILS, TAKE 4. The sample space 22 is the collec- 
tion of all sequences w := {Xn}n>1, where x, = 1 or 0. Define a random variable X,, 
by X;,(w) := £p. It is the random number obtained at the n-th toss. It is indeed a 
random variable since for all a, € {0,1}, {w; X,(w) = an} = {w; tn = an} € F, 
by definition of F. 


The following are elementary remarks. 
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Theorem 2.1.5 (a) Let E and F be countable sets. Let X be a random variable 
with values in E, and let f : E + F be an arbitrary function. Then Y := f(X) is 
a random variable. 


(b) Let FE, and Ey, be countable sets. Let X, and X2 be random variable with values 
in E, and Ey respectively. Then Y := (X,, X2) is a random variable with values 
Hi) 1) = Fy x Ey. 


Proof. (a) Let y € F. The set {w; Y(w) = y} is in F since it is a countable union 
of sets in F, namely: 


{Y=y}= DO {X =a}. 


re#; f(x)=y 


(b) Let = (a1, #2) € B. The set {w; X(w) = x} is in F since it is the intersection 
of sets in F, namely: 


{X =r} ={X, =x} N{X2 = 29}. 


Independence and Conditional Independence 


Definition 2.1.6 Two discrete random elements X and Y taking their values in 
E and F respectively are called independent if 


P(X =i,Y =j)=P(X=i)P(V=j) (iG B,jEF). (2.1) 


The left-hand side of (2.1) is P({X =i}N{Y = j}). This is a general feature of 
the notational system: commas replace intersection signs. For instance, P(A, B) 
is the probability that both events A and B occur. 


Definition 2.1.7 The discrete random elements X,,...,X), taking their values in 
Ey,...,E, are said to be independent if for alli, € Ey,...,in © Ex, 


PH tie Se SPOS POG Sh: (2.2) 


Definition 2.1.8 A sequence {X,}n>1 of discrete random elements indexed by the 
set of positive integers and taking their values in the sets {En}n>1 respectively is 
called an independent sequence if any finite collection of distinct random elements 
Xi,,...,Xi, extracted from this sequence are independent. 
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Definition 2.1.9 The sequence of discrete random elements {Xn}n>1 18 said to be 
an independent and identically distributed sequence (for short: an IID sequence) 


if 
(a) the X,8 take their values in the same set E, 
(b) the family {Xn}n>1 is independent, and 


(c) the probability distribution function of X, does not depend on n. 


EXAMPLE 2.1.10: HEADS OR TAILS, TAKE 5. We show that the sequence 
{Xn}n>1 is WD. (Therefore, we have a model for independent tosses of an unbiased 
coin.) 


Proof. Event {X; = a,} is the direct sum of the events {X; = a1,...,X,4-1 = 
dp—1, Xk = ax} for all possible values of (a1,...,@,—1). Since there are 2’! such 
values and each one has probability 2~*, we have P(X; = a,) = 2*-!2-*, that: is, 


Nile 


Therefore, 
P(X, =q,..., Xp = ay) = P(X, = q)--- P(X_ = ay) 


for all ay,...,a, € {0,1}, from which it follows by definition that X,,...,X, are 
independent random variables, and more generally that {X,}n>1 is a family o 
independent random variables. 


EXAMPLE 2.1.11: HEADS OR TAILS, TAKE 6. The number of occurrences o 
heads in n tosses is S, = X, +---+ X,. This random variable is the fortune at 
time n of a gambler systematically betting on heads. We compute its probability 
distribution when the X,,’s are 1D, but P(X, = 1) = p € (0,1) (allowing for a 
bias of the coin). The sum S,, takes the integer values 0 to n. The event {S,, = k} 
is “k among Xj,..., Xp, are equal to 1”. There are (7) distinct ways of assigning k 
values of 1 and n — k values of 0 to Xy,..., Xn, and all have the same probability 
p*(1 — p)"-*. Therefore 


P(Sn = k) = (f)p(1 =p). 
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Definition 2.1.12 Let {Xn}nsi and {Yn}n>1 be sequences of discrete random el- 
ements indexed by the positive integers and taking their values in the sets {Ey }n>1 
and {F,}n>1 respectively. They are said to be independent sequences if any finite 
collection of random elements X;,,...,X;i, and Y;,,...,Y;, extracted from their 


respective sequences are independent, the discrete random elements (Xj,,..., Xi,) 
and (Y;,,...,Yi,) are independent. 
This means that 
P (Mart Xie = @e}) 1 (Ana {Vim = Om})) 
= P(Mar{Xig = ae}) P (nar {Vim = bm}) (2.3) 


for all a, € F,,...,a, € E, and all bj € Fi,...,b, © Fy. 


The notion of conditional independence for events (Definition 1.3.10) extends 
naturally to discrete random variables. 


Definition 2.1.13 Let X, Y, Z be random variables taking their values in the 
denumerable sets E, F', G, respectively. One says that X and Y are conditionally 
independent given Z if for all x, y, z in E, F, G, respectively, events {X = x} 
and {Y = y} are conditionally independent given {Z = z}. 


Theorem 2.1.14 Let X, Y, and Z be three discrete random variables with val- 
ues in E, F, and G, respectively. If for some function g : E x F + (0,1), 
IM(X=2e|¥ =u,4 = 2) = o(@,y) jor dix. y, 2, den lh XX =x | ¥ = 9) = o@, 9) 
for all x,y, and X and Z are conditionally independent given Y . 


Proof. We have 


PkaeY aya) P Xan y ay a8) 
=\ PSSe|¥=92aorr a7 72) 


=p(oig) PY = 9,2 =e) =9 PY =). 


Therefore, 


P(X =2|Y =y) = 9(a,y) = P(X =2|Y =y,7Z=2). 
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Expectation 


Definition 2.1.15 Let X be a discrete random variable taking its values in the 
countable set E and letg: E > R be a function that is either non-negative or 
such that 


Tuer l9(0)|/P(X = 2) <o0. (2.4) 
One then defines E|g(X)], the expectation of g(X), by the formula 
Elg(X)] = ecw g(t) P(X = 2). (2.5) 


If the summability condition (2.4) is satisfied, the random variable g(X) is 
called integrable, and in this case the expectation E[g(X)] is a finite number. If g 
is only assumed non-negative, the expectation may be infinite. 


EXAMPLE 2.1.16: HEADS OR TAILS, TAKE 7. Consider the random variable 
S, = X,+---+X,. We compute its expectation. 


n 


ElSn] = )7éP(Sa = 8) = Si(")o _ pyr 


i=1 i=1 


(= 1) i-1 (n—1)—(i-1) 
= S np- pL) 
> G—1)((n—1)—- G—-1))! 
Performing the change of variables 7 = i — 1, we obtain 
n-1 
(n— 1)! j -1)-j 
EIS, =n py 1- (n-l)-3 =). 
Sa] ="? Ga —- Gye a : 


It is important to realize that a discrete random variable taking finite values 
may have an infinite expectation: 


EXAMPLE 2.1.17: FINITE RANDOM VARIABLES WITH INFINITE EXPECTA- 
TIONS. Let X, taking values in F = N, have the distribution P(X =n) = 4 


cn? 


(n € N), where the constant c is chosen such that 
Co 


Pee Pee=>. | 


cn 


n=1 n=1 
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(that isc= O°, 4= =), In fact, the expectation of X is 


E[X] = )>nP(X =n) =n; = = : 
n=1 n=1 = 


The above example is artificial, but there are more natural occurrences of 
the phenomenon. Consider for instance Example 2.1.11 (“heads or tails” with 
an unbiased coin). The quantity 2S, —n is the fortune at time n of a gambler 
systematically betting one bitcoin on heads (and therefore losing one bitcoin on 
tails). Let T be the first integer n > 0 (necessarily even) such that 2S, —n = 0. 
Then as it turns out as we shall prove later, in Example 9.1.29, that T' is a finite 
random variable with infinite expectation. 


Theorem 2.1.18 Let A be some event. The expectation of the indicator random 
variable X = 1, is 
E[lal = P(A). (2.6) 


Proof. X = 1, takes the value 1 with probability P(X = 1) = P(A) and the 


value 0 with probability P(X = 0) = P(A) = 1-— P(A). Therefore, 


E[X] =0 x P(X =0)4+1x P(X =1)=P(X =1) =P(A). 


Theorem 2.1.19 Let g, : E > R and gq : E > R be functions such that 9,(X) 
and g2(X) are integrable (resp., non-negative), and let 1,2 € R (resp., € R,). 
Expectation is linear 


Epig(X) + A2g2(X)] = ArElgi(X)] + A2E|92(X)]- (2.7) 
Also, expectation is monotone, in the sense that g(a) < go(x) for all x implies 
Elg(X)] < Elgo(X)]. (2.8) 
Also, we have the triangle inequality 


|E[9(X)]l < Ellg(X)I]- (2.9) 


Proof. These properties follow from the corresponding properties of series. 


The next example gives an alternative way of computing the expectation of an 
integer-valued random variable. 
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Theorem 2.1.20 For an integer-valued (that is, taking its values in N) random 
variable X, we have the telescope formula: 


EX a eX 


Proof. 


E[X] = P(X =1)4+2P(X =2)4+3P(X =3)+--- 

= P(X =1) +P(X =2)+ P(X =3) 
P(X =2)+ P(X =3) 
P(X = 3) 


Theorem 5.2.4 will generalize this formula. 


Definition 2.1.21 Let X be an integer-valued random variable such that E||X|] < 
oo. Then X is said to be integrable. In this case (only in this case), one defines 
the mean of X as the (finite) number 


p= BX] =) nP(t =n): 


n=0 


From the inequality |a| < 1+ <2, true for all a € R, we have that |X| < 1+ X?, 
and therefore, by the monotonicity and linearity properties, E[|X|] < 1+ E[X?] 
(we also used the fact that E[1] = 1). Therefore if E[X°] < oo (in which case 
we say that X is square-integrable) then X is integrable. The following definition 
then makes sense. 


Definition 2.1.22 Let X be a square-integrable random variable. We then define 
the variance o” of X by 


-+oo 


o? = BUX — p)"] = \o(n— np)? P(X =n). 


n=0 


The variance is also denoted by Var (X). From the linearity of expectation, it 
follows that E[(X — m)?] = E[X?] — 2mE[X] + m?, that is, 


Var (X) = E[X?] —m?. (2.10) 
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Markov’s Inequality 


Theorem 2.1.23 Let Z be a non-negative discrete random variable and let a be 
a positive number. Then (Markov’s inequality): 


Bee 


Proof. Take expectations in the inequality Z > alyz>q}- 


Taking Z = (X —m)? in Markov’s inequality and a = ¢?, we obtain Chebyshev’s 
inequality: 
Var (X) 


P(X=| 26) 5 ——, 


E 

The Markov inequality, its corollary the Chebyshev inequality and the upcom- 

ing Jensen’s inequality will apply in very general situations, as we shall see in the 
sequel. 


EXAMPLE 2.1.24: | BERNSTEIN’S POLYNOMIAL APPROXIMATION A continuous 
function f from [0,1] into R can be approximated by a polynomial. More precisely, 
for all x € [0,1], 


f(z) = limptooPn(2) , (x) 


where 


P,(2) = 3 (2) to, 


and the convergence of the series in the right-hand side is uniform in [0,1]. This 
classical result of analysis will now be proved using probabilistic arguments. 


el (S))-ErO)ns-9- $1) aati 


The function f is continuous on the bounded interval [0,1] and therefore uniformly 
continuous on this interval. Therefore to any ¢ > 0, one can associate a number 
6(€) such that if |y—2a] < 6(e), then | f(x)— f(y)| < ¢. Being continuous on [0, 1], f 
is bounded on [0,1] by some finite number, say M@. Now 


Pua) — sol =| [ (4) - st]] < 2 || (2) - 100 
ef) noo] 8) 


u). 
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where A := {w; |S,(w)/n) — 2| < d(e)}. Since |f(S,/n) — f(x)|1lq < 2M1q, we 


have 
|r (*) - 10 > se) 


Also, by definition of A and of d(¢), 
Si 
B||r (=) - 40 
Sn 


|P,,(a) — f(x)| < e+ 2uP ( . 


Sn 
— A 
nm 


Ix < 2MP(A) = 2MP ( 


ly <€é. 


> ale) 


But x is the mean of S;,/n, and the variance of S,,/n is nx(1—2) < n/4. Therefore, 
by Chebyshev’s inequality, 


Therefore 


—a@ 


Se 4 
P(|# 200) seo 
Finally 
f(a) — Py(a)| Se+ ewe, 
and 


Since « > 0 is otherwise arbitrary, this suffices to prove the convergence in (x). 
The convergence is uniform since the right-hand side of the latter inequality does 
not depend on x € [0, 1]. 


Jensen’s Inequality 


This inequality concerns the expectation of convex functions of a random variable. 
We therefore start by recalling the definition of a convex function. Let J be an 
interval of R (closed, open, semi-closed, infinite, etc.) with non-empty interior 
(a,b). The function y : I > R is called a convex function if for all x,y € I and 
alO0<é@<1, 


p(Or + (1— A)y) < Op(x) + (1 — A)yly). 


If the inequality is strict for all « 4 y and all 0 < 6 < 1, the function ¢ is said to 
be strictly convex. 
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Theorem 2.1.25 Let I be as above and let p: I + R be a convex function. Let 
X be an integrable discrete real-valued random variable such that P(X € I) = 1. 
Assume moreover that either p is non-negative, or that p(X) is integrable. Then 
(Jensen’s inequality) 


E[y(X)| 2 y(E[X)). 


P(X) 


Proof. A convex function y has the property that for any 29 € (a,b), there exists 
a straight line y = ax + @ passing through (29, y(Zo)), that is, 


Y(%o) = ax +B, (x) 


and such that for all « € (a,b), 
g(x) 2ar+ B, (**) 


where the inequality is strict if y is strictly convex. (The parameters a and 
may depend on x and may not be unique.) Take a = E[X]. In particular 
p(EL[X]) = aL[X]+ 6. By («*), y(X) > aX + 6, and taking expectations using 
(x), 


Ely(X)] 2 aF[X] + 6 = g(EIX]). 
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Moment Bounds 


Theorem 2.1.26 (a) For any integer-valued random variable X, we have the first 
moment bound 


P(X £0) < E[X]. 


(b) For a square-integrable real-valued discrete random variable X, we have the 
second moment bound 
Var (X) 


P(X =0)< EXP 


Proof. (a) For the first moment bound, 


P(X #0) = P(X =1)+ P(X =2)+ P(X =3)4+-- 
< P(X =1)+2P(X =2)+3P(X =3)+---=E[X]. 


(b) Since the event X = 0 implies the event |X — E[X]| > E[X], 


Var (X) 
P(X =0) < P(|X — E|X]| 2 EIX]) < ELE ” 


where the last inequality is Chebyshev’s inequality. 


Product Rule for Expectation 


The product formula for expectations featured in the next theorem will be met 
several times in this book and is in fact very general (see Theorem 5.4.4). 


Theorem 2.1.27 Let Y and Z be two independent discrete random elements with 
values in the countable sets F and G respectively, and letv: F > R, w:G—> R 
be functions which are either non-negative, or such that v(Y) and w(Z) are both 
integrable. Then 


Elo(Y)w(Z)| = Elo) E[w(Z)]. 


Proof. Consider the discrete random element X with values in EF = Fx G defined 
by X = (Y,Z), and let the function g : E > R be defined by g(x) = v(y)w(z) 
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(x = (y, z)). We have, under the prevailing conditions, 


Elv(Y)w(Z)] = Elg(X)] = S¢ g(a) P(X = 2) 
= (z)P(Y =y,Z =z) 
- sy dy 2)P(Y = y)P(Z = 2) 
= ae u(y)P(Y = - (= w(z)P(Z = ») 


= Blo(Y)JE{w(Z)). 


The following consequence of the product rule is extremely important. It says 
that for independent random variables, “variances add up”. 


Theorem 2.1.28 Let X,,..., Xp be independent integrable discrete random vari- 
ables with real values. Then 


OX, pth, = OK, °° +0%,. (2.11) 


Proof. Let [11,..., {ln be the respective means of X;,...,X,. The mean of the 
sum X := Xy +---+ X, is wi= fi t-+++ pn. Ii #k, we have, by the product 
formula for expectations, 


E[(Xi — wi)(Xe — pe)] = B[(Xi — ps)] B [(Xe — pe)] = 9. 


Therefore 


=E£ yw — pis) om —m) 
- dal Xi — bi)(Xe — He) 
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Note that means always add up, even when the random variables are not in- 
dependent. 


Let X be an integrable discrete random variable. Then, clearly, for any a € R, 
aX is integrable and its variance is given by the formula 


Var (aX) = a? Var (X). 


EXAMPLE 2.1.29: VARIANCE OF THE EMPIRICAL MEAN. jFrom the above 
remark and Theorem 2.1.28, it follows that if X),...,X, are independent and 
identically distributed integrable random variables with real values and common 


variance o”, then 
Xyte- +X, a 
Var ( : ) = : 
n n 


EXAMPLE 2.1.30: THE WEAK LAW OF LARGE NUMBERS. Let {Xn}n>1 be 
an independent sequence of real-valued discrete random variables with the same 
probability distribution, mean (supposed well defined) m and variance 0? < oo. 
Then, since the variance of the n-th order empirical mean X,, := a is equal 


< a 
~ nte 


Therefore the empirical mean X,, converges to the mean m in probability, which 
means exactly (by definition of the convergence in probability) that, for all e > 0, 


to x we have by Chebyshev’s inequality, for all ¢ > 0, 


P(X, —mm| >) =P ([e Es 


n 


lim P(|X, —m| >) =0. 


ntoo 


This result is called the weak law of large numbers in order to distinguish it 
from a much more powerful result, the strong law of large numbers in Chapter 6. 
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2.2 Remarkable Discrete Distributions 


Uniform 
Let ¥ be a finite set whose cardinality is denoted by |4’|. The random variable 
with values in this set and having the distribution 


P(X =2)= (x € X) 


1 
|x| 
is said to be uniformly distributed (or to have the uniform distribution) on X. 


EXAMPLE 2.2.1: IS THIS NUMBER THE LARGER ONE? Let a and b be two 
numbers in {1,2,..., 10,000}, with a > b. Only one of these numbers is shown to 
you, chosen at random and equiprobably. Call X this (now random) number. Is 
there a good strategy for guessing if the number shown to you is the largest of the 
two? Of course, we would like to have a probability of success strictly larger than 
3. Perhaps surprisingly, there is such a strategy, that we now describe. Select at 
random uniformly on {1,2,...,10,000} a number Y. If X > Y, say that X is the 
largest (= a), otherwise say that it is the smallest. 

Let us compute the probability Pz of a wrong guess. An error occurs when 
either (i) X > Y and X = 6, or (ii) X < Y and X =a. These events are exclusive 
of one another, and therefore 


Pp =P 


(X>Y,X =b) + P(X <Y,X =a) 
(b>Y,X =b)+ P(a< Y,X =a) 

(b> Y)P(X =b) + P(a<Y)P(X =a) 

(b> V)5 4 P(a<¥)5 =5(P(>Y) + Pla<Y)) 


(1— P(Y €[b+1,a]) = ; (: aa < = 


Let {Xn}n>1 be an HD sequence of random variables taking their values in the 
set {0,1} and with a common distribution given by 


P(X, =1)=p (pe (0,1). 


Define the Hamming weight h(a) of the binary vector a = (a1, d2,...,@n) € {0, 1}” 
by h(a) := iy a;. Since P(X; = aj) = p or 1 — p depending on whether a; = 1 
or 0, and since there are exactly h(a) coordinates of a that are equal to 1, 


P(X, = @1,--- Xk = ax) = i ’ (2.12) 
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where q:=1-—p. 


Comparing with Examples 1.1.3 and 1.2.3, we see that we have modelized a 
game of heads or tails, with a biased coin if p 4 3. The sequence {X;,}n>1 is called 
a Bernoulli sequence of parameter p. 


EXAMPLE 2.2.2: THE GAMBLER’S RUIN. Two players A and B play “heads or 
tails”, where heads occur with probability p € (0,1) and the successive outcomes 
form an IID sequence. Calling X, the fortune in dollars of player A at time n, 
then Xn41 = Xn + Zrii, where 7,41 = +1 (resp., —1) with probability p (resp., 
q=1-p), and {Z,}n>1 is uD. In other words, A bets $1 on heads at each toss, 
and B bets $1 on tails. The respective initial fortunes of A and B are a and b 
(positive integers). The game ends when a player is ruined. The duration of the 
game is 7’, the first time n at which X, = 0 or c, and the probability of winning 
for A is u(a) = P(Xp =c| Xp =a). We shall compute u(a). 


A wins 
c=a+b 


1 2 3 4 5 6 7 8 9 10 T=11 


The gambler’s ruin 


Instead of computing u(a) alone, we shall compute 


u(t) = P(Xp =c| Xo = 2) 


for all states i (0 <i <c). For this, we first obtain a recurrence equation for u(7) 
by breaking down event “A wins” according to what can happen after the first step 
(the first toss) and using the Bayes rule of total causes. If Xp =i (1 <i<c-—1), 
then X; = i+1 (resp., X; = i—1) with probability p (resp., g), and the probability 
of winning for A with updated initial fortune 7+ 1 (resp., 7— 1) is u(i+ 1) (resp., 
u(i —1)). Therefore, for 7 (1 <i<c-—1), 


u(t) = pu(it+ 1) + qu(i-1), 
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with the boundary conditions u(0) = 0, u(c) = 1. The characteristic equation 
associated with this linear recurrence equation is pr? —r+q = 0. It has two 
distinct roots, ry = 1 and rz = a if p ~ q, and a double root, r; = 1, ifp=q= s. 


Therefore, the general solution is u(i) = Art + pr} =A+ py (2) when p 4 q, and 
u(t) = Art + pirt = A+ pi when p=q= s. Taking into account the boundary 
conditions, one can determine the values of \ and pu. The result is, for p 4 q, 


, 1+) 
ui) = 7, 
P 
and for p=q =, 
; a 
u(t) = a 


In the case p = q = 3, the probability v(i) that B wins when the initial fortune of B 
is c—7 is obtained by replacing 7 by c—7 in the expression for u(z): v(z) = — = 1-2. 
One checks that u(i) + v(i) = 1, which means in particular that the probability 
that the game lasts forever is null. The reader is invited to check that the same is 
true in the case p # q. 


The framework of heads or tails shelters the three most common discrete ran- 
dom variables: the binomial, the geometric and the Poisson random variables. 


Binomial 


Definition 2.2.3 A random variable X taking its values in the set E = {0,1,...,n} 
and with the probability distribution 


px =i)=(")eG-pt Osisn 


is called a binomial random variable of size n and parameter p € (0,1). This is 
denoted by X ~ B(n,p). 


The mean and the variance of a binomial random variable X of size n and 
parameter p are given by 
E[X]=np, 


Var (X) = np(1—p), 
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Proof. In Example 2.1.11, it was proved that the number of occurrences of heads 
in n tosses, S, = X,+---+ Xy, is a binomial random variable B(n, p). We have 


E(S,| =) B[X] =nE[X\] 
i=1 
and since the X;’s are IID, 
Var (Sp) = ep Var (X;) = nVar (X)). 


i=1 


Now 


E[X,] =0x P(X, =0) +1 P(X, =1) = P(X, =1) =p, 


and since X?7 = Xi, 
B\ XG) = 2) =p: 


Therefore 
Var (X,) = E [X?] - B[X,? =p—p? =p(1-p) 
and 
Var (Sn) = np(1— p). 
Geometric 


Definition 2.2.4 A random variable X taking its values in Ny and with the dis- 
tribution 


P(T =k) =(1—p)*"'p, 


where 0 < p< 1, ts called a geometric random variable with parameter p. This is 
denoted by X ~ Geo(p). 


EXAMPLE 2.2.5: FIRST “HEADS” IN THE SEQUENCE. Define the random vari- 
able T' to be the first time of occurrence of 1 in the Bernoulli sequence {X,}n>1, 
that is, 

T =inf{n > 1;X, =1}, 


with the convention that if X,, = 0 for all n > 1, then T = co. The event {T = k} 
is exactly {X, =0,..., X,_1 = 0, X;, = 1}, and therefore, 


P(T =k) = P(X, =0)--- P(Xx_-1 = 0)P(X, = 1), 
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that is, for k > 1, 
P(T =k) =(1—p)*"'p. 


The mean of a geometric random variable X with parameter p > 0 is 


(2.13) 


Proof. 


But for a € (0,1), 


sk I[oe af 1 1 
kl OC w\ _ _ 
ee ~ 0a (-«') da (+ 1) (l1—a)? 


Therefore 


that is, finally, (2.13). 


EXAMPLE 2.2.6: THE COUPON COLLECTOR. In a certain brand of chocolate 
tablets one can find coupons, one for each tablet, randomly and independently 
chosen among n types. A prize may be claimed once the chocolate amateur has 
gathered a collection containing a subset with all the types of coupons. We shall 
compute the average value of the number X of chocolate tablets bought when this 
happens for the first time. For this, let X; (0 <i < n—1) be the number of tablets 
bought while exactly i coupons of different types have been collected, so that 


Each X; is a geometric random variable with parameter p; = 1 — +. In particular 


((2.13)), 
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and therefore 
“1 
E[X]=)_ F(X] =ndUc- 
i=0 i=l 

The sum H(n) := S7_, + (called the n-th harmonic number) satisfies the inequal- 
ities 

Inn < H(n) <Inn+1 
as can be seen by expressing Inn as the integral i + dx, cutting the domain of 


integration into segments of unit length, and using the fact that the integrand is 
a decreasing function, which gives the inequalities 


that is 
H(n)-1<Inn< H(n—-1). 
This gives the inequalities 


H(n) <Inn+1 


and i 
H(n) > In(n + 1) ina és 


Therefore H(n) = Inn + O(1)? and, finally, 
E([X] =nlnn+O(n). 


In fact, observing that |)77_, 1/i — nn| < 1, we have that |Z [X] —nInn| < n. 


Poisson 
Definition 2.2.7 A random variable X taking its values in N and such that 
gr 
P(X =k)= eve (k > 0) 


is called a Poisson random variable with parameter 0 > 0. This is denoted by 


X ~ Poi(6). 


2 


f(n) = O(g(n)) means that there exists a positive real number M and an integer no such 
that |f(n)| < M|g(n)| for all n > no. This notation is part of the so-called Landau notational 
system. 
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EXAMPLE 2.2.8: 'THE POISSON LAW OF RARE EVENTS, TAKE 1. A veterinary 
surgeon in the Prussian cavalry once collected data concerning accidents due to 
horse kickbacks among soldiers. He found that the (random) number of accidents 
of this kind in a given year closely follows a Poisson distribution. The purpose of 
this example is to explain why. 


Suppose that you play “heads or tails” for a large number n of (independent) 


tosses with a coin such that e 
n 


In the Prussian cavalry example, n is the (large) number of soldiers and X; = 1 if 
the i-th soldier has been hurt by a horse. Let S,, be the total number of heads (of 
wounded soldiers). We show that for all k > 0, 


Qa 
lim p,(k) =e “*—, x 
lim p hy =e (x) 
with the convention 0! = 1. 


This explains the findings of the veterinary surgeon. The average number of 
casualties is a and the choice P(X; = 1) = © guarantees this. Letting n ¢ oo 
accounts for n being large but unknown. 


Here is the proof of the mathematical statement. As we know, the random 
variable S,, follows a binomial law: 


P(S, =k) = (1) rea 


of mean n x £ =a. With pp(k) := P(S; = k), we see that 


and 


se > 
Dn(k) 1-2  k+1° 


Therefore, (x) holds for all k > 0, showing that the limit distribution is indeed a 
Poisson distribution of mean a. 


Prk + 1) eit a a 


Theorem 2.2.9 For a Poisson random variable with parameter 0 > 0, 


E|X|=8@ and Var (X) =6. 
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Proof. The following is a direct computation. Later on we shall see a better 
approach (via generating functions). 


k=1 
=e o =e Ge = 8, 
j=0 
E |X? — x] =F (ea) eS — 1) 
k! k! 
k=0 k=2 
= e 962 3 ie — e 962 3 64 _ e 9 O28 = 62 
(k — 2)! — ; 
k=2 j=0 


= B[X?-X]+E[X]-E[XP=@+0-@=6. 


EXAMPLE 2.2.10: SUMS OF INDEPENDENT POISSON VARIABLES. Let X, and 
X» be two independent Poisson random variables with respective means 6; > 0 and 
0. > 0. Then X = X, + Xo is a Poisson random variable with mean 6 = 6; + 6b. 


Proof. For k > 0, 


P(X =k) = P(X, 4 a= 9=P (So =4m=4- 3) 


1i=0 


= So P(X, =i, X_ =k- 3) Do P(M = i) PUG = k 3) 


a pk-t e7 (1 +92) k A 
-—0, 71-6 2 ay eee 
1_=@ 2 = Se tp 

i (ka! iil » kal? 
= e (A1+62) (A+ Oy)" 

k! ; 


e 
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Hypergeometric 


In Example 1.4.1 we met the following distribution; 


es 
ou 


n 


(0<k <inf(N,n)). 
It is called the hypergeometric distribution. 


Multinomial 


Consider a random vector X = (Xj,..., Xn) where all the random variables X; 
take their values in the same denumerable space F (this restriction is not essential, 
but it simplifies the notation). Let p: E” > R, be a function such that 


Spy k 


reEekn 


Definition 2.2.11 The discrete random vector X above is said to admit the prob- 
ability distribution p(x) (a € E”) if for all sets C C E”, 


P(X EC)= S> (2) 


zEC 
In fact, there is nothing new here since X is a discrete random variable taking 
its values in the denumerable set V := E”. 


Consider the situation where k balls are placed in n boxes B,,..., B,, inde- 
pendently of one another, with the probability p; for a given ball to be assigned 
to box B;. Of course, 


Li Bi = 1. 
After placing all the balls in the boxes, there are X; balls in box B;, where 
die Xi =k 
Then 
P(X, =m,...,Xn =n) = Te mal Te (2.14) 


Proof. Observe that (a): there are k!/[J'_, (mi)! distinct ways of placing k balls 
in n boxes in such a manner that m, balls are in box B,, mz are in Bo, etc., and 
(8): each of these distinct ways occurs with the same probability [["_, p7”. 
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Definition 2.2.12 If the random vector X = (X1,...,Xn) admits the probability 
distribution given by (2.14), it is called a multinomial (random) vector of size 
(n, K) and parameters p;,...,Dn- 


2.3. Generating Functions 


Computations relative to discrete probability models often require an enumeration 
of all the possible outcomes realizing a particular event. Generating functions are 
very useful for this task and, more generally, for obtaining distribution functions 
of integer-valued random variables. 


In order to introduce this versatile tool, we must first define the expectation 
of a complex-valued function of an integer-valued variable. Let X be a discrete 
random variable with values in NN, and let y : N > C be a complex function with 
real and imaginary parts yr and yy; respectively. The expectation E[p(X)] is then 
defined by 

Bly(X)] := Elen(X)] + iBlei(X)), 


provided that the expectations on the right-hand side are well defined and finite. 


Definition 2.3.1 Let X be an N-valued random variable. Its generating function 
(GF) is the function g : D(0;1) := {z € C;|z| < 1} > C defined by 


gz) = Elz*)| = > P(X = ke. (2.15) 


Since )>~ ) P(X =n) = 1 < oo, the power series associated with the sequence 
{P(X =n)}n>o has a radius of convergence R > 1. The domain of definition of g 
could be, in specific cases, larger than the closed unit disk centered at the origin. 


In the next two examples below, the domain of absolute convergence is the 
whole complex plane. 


EXAMPLE 2.3.2: GF OF THE BINOMIAL VARIABLE. For the binomial random 
variable of size n and parameter p, 


al) = 0 (7) ya — yt = =p ep 
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EXAMPLE 2.3.3: GF OF THE POISSON VARIABLE. For the Poisson random 
variable of mean 6, 


g(z) = ets) (dz) = ef 2-1) : 


k! 
k=0 


Here is an example where the radius of convergence is finite. 


EXAMPLE 2.3.4: GF OF THE GEOMETRIC VARIABLE. For the geometric random 
variable of parameter p € (0, 1), 


g{2) = ) pl —p)F t= a a 


The radius of convergence of the above power series is i 
Theorem 2.3.5 The generating function characterizes the distribution of a ran- 
dom variable. 

This means the following. Suppose that, without knowing the distribution of 
X, you have been able to compute its generating function g, and that, moreover, 
you are able to give its power series expansion in a neighborhood of the origin:? 


g(z) = se Anz”. 
n=0 


Since g is the generating function of X, 
g(z) = > Pe =n)2", 
n=0 


and since the power series expansion around the origin is unique, the distribution 
of X is identified as 
P(X =n) = ay 


for all n > 0. Similarly, if two N-valued random variables X and Y have the same 
generating function, they have the same distribution. Indeed, the identity in a 
neighborhood of the origin of the power series: 


by P(X =n)2" = > P(Y =n)z" 


3 This is a common situation; see Theorem 2.3.12 for instance. 
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implies the identity of their coefficients. 


Theorem 2.3.6 Let X and Y be two independent integer-valued random variables 
with respective generating functions gx and gy respectively. Then the sum X + Y 
has the GF 

gx-+y (2) = gx(z) x gy(z). (2.16) 
Proof. Use the product formula for expectations: 


a2) = 2 |e | = BF [2% 2" | 8 |e") |p" | . 


EXAMPLE 2.3.7: SUM OF INDEPENDENT POISSON VARIABLES. Let X and Y 
be two independent Poisson random variables of means a and {3 respectively. We 
shall prove that the sum X + Y is a Poisson random variable with mean a+ (. In 
fact, according to (2.16), 


gx+y(2) = 9x(z) x gy(z) = eV PED = OTD | 


and the assertion follows directly from Theorem 2.3.5 since gx+y is the GF of a 
Poisson random variable with mean a + 3. 


Moments from the Generating Function 


Generating functions can be used to obtain the moments of a discrete random 
variable. 


Theorem 2.3.8 We have 
g(1) = E[X] (2.17) 


and 


g" (1) = E[X(X — 1). (2.18) 


Proof. Inside the open disk centered at the origin and of radius R, the power 
series defining the generating function g is continuous, and differentiable at any 
order term by term. In particular, differentiating twice both sides of (2.15) inside 
the open disk D(0; R) gives 


ga=) 1P xan (2.19) 
and 2 
g(z)= > n(n —1)P(X =n)2"*. (2.20) 
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When the radius of convergence R is strictly larger than 1, we obtain the announced 
results by letting z = 1 in the previous identities. 


If R = 1, the same is basically true but the mathematical argument is more 
subtle. The difficulty is not with the right-hand side of (2.19), which is always well 
defined at z = 1, being equal to )>~, nP(X = n), a non-negative and possibly 
infinite quantity. The difficulty is that g may be not differentiable at z = l,a 
border point of the disk (here of radius 1) on which it is defined. However, by 
Abel’s theorem*, the limit of °°, nP(X =n)" as the real variable x increases 
to Lis 0, nP(X n). Therefore g’, as a function defined on the real interval 
(0,1), can be expnded to [0,1] by (2.17), and this extension preserves continuity. 
With this definition of g'(1), Formula (2.17) holds true. Similarly, when R = 1, 
the function g” defined on [0,1) by (2.20) is extended to a continuous function on 
[0,1] by defining g’(1) by (2.18). 


Another useful result is Wald’s formula below, which gives the expectation of 
a random sum of independent and identically distributed integer-valued variables. 
By taking derivatives in (2.21) of Theorem 2.3.12, 


E[X] = 9x (1) = 9) or (oy (1) = EM JEIZ]. 


A stronger version of this result is given in Exercise 2.5.18. 


The next technical result gives details concerning the shape of the generating 
function restricted to the interval [0, 1]. 


Theorem 2.3.9 (a) Let g : [0,1] > R be defined by g(x) = E[x*], where X is a 
non-negative integer-valued random variable. Then g is non-decreasing and convex. 
Moreover, if P(X =0) <1, then g is strictly increasing, and if P(X <1) <1, it 
is strictly convex. 

(8) Suppose P(X <1) <1. If E[X] < 1, the equation x = g(x) has a unique 
solution x € [0,1], namely x = 1. If E|X] > 1, it has two solutions in [0,1], 2 = 1 
and x = Xo € (0,1). 


4 Let {a@n}n>1 be a sequence of real numbers such that the radius of convergence of the 
power series )°°°_ 9 G2” is 1. Suppose that the sum )7°° 9a, is convergent. Then the power 
series )0>° 9 @nx” is uniformly convergent in [0,1] and 
s n=o0 @n y g ’ 


where « + 1 means that x tends to 1 strictly from below. 
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Proof. Just observe that for x € [0,1], 
g (2) = So nP(X ane"? > 0, 
n=1 


and therefore g is non-decreasing, and 


oC —1)P(X —n)z"? >0, 


n=2 


g(x) 


and therefore g is convex. For g'(x) to be null for some x € (0,1), it is necessary 
to have P(X =n) = 0 for all n > 1, and therefore P(X = 0) = 1. For g(x) to be 
null for some x € (0,1), one must have P(X = n) = 0 for all n > 2, and therefore 
P(X =0)4+ P(X =1)=1. 


E[X] <1 E[X]>1 


Two aspects of the generating function 


The graph of g : [0,1] + R has, in the strictly increasing strictly convex case 
P(X = 0) + P(X = 1) < 1, the general shape shown in the figure, where we 
distinguish two cases: E|X] = g/(1) < 1, and E[X] = g/(1) > 1. The rest of the 
proof is then easy. 


The next two examples are typical of the use of generating functions in com- 
binatorics. 


EXAMPLE 2.3.10: THE LOTTERY. We compute the probability that in a 6- 
digit lottery the sum of the first three digits equals the sum of the last three digits. 
(The digits from 0 to 9 are drawn equiprobably and independently and the result 
is presented in the order they appear.) 


Let X1, X2, X3, X4, X5, and X¢ be independent random variables uniformly 
distributed over {0,1,...,9}. We first compute the generating function of 
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Y = 274+ X, + Xo94+ X3 — X4— X;5 — X¢. We have 


_— x10 
BS =Tltet- ta) =o a, 
ey 
7 1 1 1 
Elz *] a(t 5) 


1 t=3 111-2! 


and 
EL’) = EI ue Xi->8, sl 
3 6 
E\2"|[<*[][2* |- — “TLE [fat oes] 

i=l i=4 i=1 

Therefore, 
1 (1— 219)° 
gy (2) = 106 (1 = z)6 


But P(X, + Xo + X3 = X,+ X5 + X6) = P(Y = 27) is the factor of 2°" in the 
power series expansion around the origin of gy. Since 


(1 — 21°)6 =1- (S24 Ole t-- 


and 


(recall the negative binomial formula (1.18): 


+1 +2 
a-ayrate(,? Jer (Pt) a+ Ct ee 


we find that 


PY =27) = a5 (5) 7 (;) (5) : (2) (5) : 
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EXAMPLE 2.3.11: BIASED DICE. Does there exist two biased dice such that, 
when tossed independently, the sum of their values is uniformly distributed on 
{2,3,...,12}? The answer is no. 


To see this, let us call X, and X> the values obtained by tossing the first and 
the second die respectively, and g; and go the corresponding generating functions. 
The generating function of X := X, + Xo is g(z) = gi(z) X go(z) since the dice 
are supposed to be tossed independently. If the sum was uniformly distributed on 
{2,3,...,12}, then we would have 


Det ns =! 
(2) x go(2) = (2 +e +2) = SE 
Equivalently, 
1 
Pi(2)Pa(z) = (Ll +24---42"), (x) 


where the polynomials 


P(z) = 92) = So P(X: =k+1)z* (¢=1,2) 


k=0 


have common degree 5. Being of odd degree they each have at least one real root, 
whereas the right-hand side of (x) has no real roots (its roots are the ten eleventh 
roots of unity not equal to 1). Hence a contradiction. 


Random Sums 


How to compute the distribution of random sums? Here again, generating func- 
tions help. 


Theorem 2.3.12 Let {Y,}n>1 be an MD sequence of integer-valued random vari- 
ables with the common generating function gy. Let T be another random variable, 
integer-valued, independent of the sequence {Yn}n>1, and let gr be its generating 
function. The generating function of 


, 0 : 
where by convention )>_, = 0, is 


9x (2) = gr (gy (z)) - (2.21) 
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Proof. Since {{T = k}},., is a sequence of mutually exclusive and exhaustive 
subsets of Q, 


and 
gk = gina Yn (>: ta goon=i ¥1 
k=0 
= 5 Ce a) lary =D (2% ; *) Ler=n} - 
k=0 k=0 

Therefore, 

ca k 
E[z*] = ye [tre k} (= te Sel Lerany|E[22n=1 *], 

k=0 


where we have used the assumption of independence of T and {Y,}n>1. Now, 
E(lyr=ey] = P(T = k), and 


n=1 n=1 n=1 
and therefore z 
El2z*] = $° P(T = k)gy(2)* = gr (gv(z)) 
k=0 


EXAMPLE 2.3.13: THINNING OF A POISSON RANDOM VARIABLE. Let {Xn}n>1 
be a Bernoulli sequence of parameter p, and let T be a Poisson random variable 
with mean 6 > 0, independent of {X,,}n>1. We show that 


S:=X,+---4+Xp 


is a Poisson random variable with mean p@. (In other words, if one “thins out” with 
thinning probability 1 — p a population sample of Poissonian size, the remaining 
sample also has a Poissonian size, with the obvious mean, that is, p times that of 
the original sample.) Indeed, in this case, 


gr(z) 7 ef@—-D 
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and 
gy(z) =pz+(1—p), 
so that 


Compare with a direct proof: 


P(S =k) = P(X,4+---+ Xp =k) 
= P(U%,{X +--- +X, =k, T =n}) 


=So P(X t+ +X, =k, T =n) 
n=k 


=S PX, 4X, =k)P(T =n), 


n=k 


that is 


esd n 
n! ie? 


P(IS=')=) ams al 


n=k 


_o (p0)* «= (q0)"™* 
— = S- q 


Branching Trees 


Francis Galton, a cousin of Darwin, was interested in the survival probability of 
a given line of English peerage. He posed the problem in the Educational Times 
in 1873. In the same year and the same journal, Watson proposed the method of 
solution that has become a textbook classic, and thereby initiated an important 
domain of probability called branching process theory. 


Here is the description of the Galton—Watson model, the statement of Galton’s 
purpose and Watson’s solution. 
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Let Z, = (Zh, Z,...), where the random variables A eee are IID and 
integer-valued. The recurrence equation 


em k 
ee Sree Ake (2.22) 


with the convention X,,,; = 0 if X, = 0, receives the following interpretation: X, 
is the number of individuals in the nth generation of a given population (humans, 
particles, etc.). Individual number k of the nth generation gives birth to FAG 
descendants, and this accounts for Eqn. (2.22). The number Xo of ancestors is 
assumed to be independent of {Z,}n>1. The sequence of random variables {X;,}n>0 
is called a branching process because of the genealogical tree that it generates (see 
Figure 2.1). The branching process is also known as the Galton—Watson process. 


Figure 2.1: Sample tree of a branching process 


With the purpose of obtaining the probability of extinction of the population, 
first observe that the event € = “an extinction occurs” is just “at least one gener- 
ation is empty”, that is, 


E =U {Xn = OF, 
In order to discard trivial cases, assume that P(Z = 0) < 1 and P(Z > 2) > 0. 
Let g be the common generating function of the variables ZM). Let 
Un(z) = Elz*] 


be the generating function of the number of individuals in the nth generation. We 
prove successively that 


(a) P(Anti = 0) = g(P(Xn = 0)), 
(b) P(E) = g(P(E)), and 
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(c) if E[Z] < 1 the probability of extinction is 1; and if E[Z,] > 1, the proba- 
bility of extinction is < 1 but nonzero. 


Proof. 


(a) In Equation (2.22), X,, is independent of the Ge Pe. Therefore, by Theorem 
2.3.12, 


Iterating this equality, we obtain wn41(z) = vo(g*(z)), where g™ is the nth 


iterate of g. If there is only one ancestor, then wo(z) = z, and therefore Wn41(z) = 
gt) (z) = g(g™(z)), that is, 


Yn+i(z) = 9nlz)) - 
In particular, since ¢),(0) = P(X, = 0), we have (a). 
(b) Since X,, = 0 implies X,41 = 0, the family {X,, = 0} is non-decreasing, 


and by monotone sequential continuity, 


P(E) = _ P(X, = 0). 


The generating function g is continuous, and therefore from (a) and the last equa- 
tion, the probability of extinction satisfies (b). 


(c) Let Z be any of the random variables Z), Excluding the trivial cases 
where P(Z = 0) = 1 or P(Z > 2) = 0, we have by Theorem 2.3.9 that: 


(a) if B[Z] < 1, the only solution of « = g(x) in [0,1] is 1, and therefore 
P(E) =1. The branching process eventually becomes extinct, and 


(8) if E[Z] > 1, there are two solutions of « = g(x) in [0,1], 1 and x such that 
0 < x < 1. From the strict convexity of g : [0,1] > [0,1], it follows that the 
sequence y, = P(X, = 0) that satisfies yo = 0 and Yn41 = g(Yn) converges to 
xo. Therefore, when the mean number of descendants E[Z] is strictly larger 
than 1, P(E) € (0,1). 


EXAMPLE 2.3.14: EXTINCTION PROBABILITY FOR A POISSON OFFSPRING. 
Take for the offspring distribution the Poisson distribution with mean > 0, whose 
generating function is g(x) = e-). Suppose that \ > 1 (the supercritical case). 


The probability of extinction P(E) is the unique solution in (0,1) of a = e@-», 
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EXAMPLE 2.3.15: EXTINCTION PROBABILITY FOR A BINOMIAL OFFSPRING. 
Take for the offspring distribution the binomial distribution B(N, p), with 0 < p< 
1. Its mean is m = Np and its generating function is g(x) = (px + (1 — p))%. 
Suppose that Np > 1 (the supercritical case). The probability of extinction P(€) 
is the unique solution in (0,1) of x = (px + (1—p))%. 


EXAMPLE 2.3.16: POISSON BRANCHING AS THE LIMIT OF BINOMIAL BRANCH- 


ING. Suppose now that p = a with A > 1 (therefore we are in the supercritical 


case) and the probability of extinction is given by the unique solution in (0,1) of 


r= (As+a-%)' = (1-0-2). 


Letting N + co, we see that the right-hand side tends from below (1 — x < e~*) to 
the generating function of a Poisson variable with mean \. Using this fact and the 
concavity of the generating functions, it follows that the probability of extinction 
also tends to the probability of extinction relative to the Poisson distribution of 
the offspring. 


Let T be the extinction time of the Galton-Watson branching process. The 
distribution of T is fully described by 


P(T <n) =P(X,=0)=¥n(0) (n> 0) 
and P(T = oo) =1-— P(E). In particular 


lim P(T <n) = P(E). (x) 


ntoo 


Theorem 2.3.17 In the supercritical case (m > 1 and therefore 0 < P(E) <1), 
P(E)—P(T <n) < g(P(E)))" - (2.23) 
Proof. The probability of extinction P(E) is the limit of the sequence x, = 


P(X, = 0) satisfying the recurrence equation 2,41 = g(%n) with initial value 
tq = 0. We have that 


0 < P(E) — tny1 = P(E) — g(@n) = g(P(E)) — g(an); 
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that is, 
P(E) = 2%n41 _ g(P(E)) — 9(%n) 


P(E) —@n lO 9 (P(E)), 


where we have taken into account the convexity of g and the inequality x, < P(€). 
The result follows from there by recurrence. 


EXAMPLE 2.3.18: CONVERGENCE RATE FOR THE POISSON OFFSPRING DIS- 
TRIBUTION. For a Poisson offspring with mean m = \ > 1, g/(x) = Ag(x) and 
therefore g'(P(E)) = AP(E). Therefore 


P(E) — P(T <n) < (AP(E))”. 


EXAMPLE 2.3.19: CONVERGENCE RATE FOR THE BINOMIAL OFFSPRING DIS- 
TRIBUTION. For a B(N,p) offspring with mean m = Np > 1, g(a) = Np—2_ 


1—p(1—2) 
and therefore 
; Pf) 
P(E)) = Np————————-.. 
g(P(E)) PT FP) 
Taking p = 4, 
/ Pn (E 
(Ae 


1— %(1— Py(€))’ 


where the notation stresses the dependence of the extinction probability on N. 


2.4 Conditional Expectation I 


Conditioning is the most important concept of probability theory after indepen- 
dence. We have already encountered this notion under the form of the conditional 
probability of an event given an event and the Bayes formulas. This book intro- 
duces the conditional expectation of a random variable given a random variable 
progressively, starting from the discrete case and then proceeding to the absolutely 
continuous case (Chapter 3), and finally giving the general theory in Chapter 5. 
We start with the notion of conditional expectation of a random variable given 
an event. Let Z be a discrete random variable with values in FE, and let f: E> R 
be a non-negative function. Let A be some event of positive probability. The con- 
ditional expectation of f(Z) given A, denoted by FE [f(Z) | A], is by definition the 
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expectation when the distribution of Z is replaced by its conditional distribution 
given A: 


BUf(Z)| Al = Do f(@P(Z = 2 A). 


Let {A;}ien be a partition of the sample space. The following formula is then a 
direct consequence of Bayes’ formula of total causes: 


E(f(Z)] =>) ELf(Z)| Al P(Ad). 


ieN 


EXAMPLE 2.4.1: RANDOM QUICKSORT. We want to sort a sequence of numbers 
in increasing order, say 7,6,4,2,9,3,1,8,5. The quicksort algorithm proposes to 
choose one of these numbers at random, say 4, called the pivot. It then scans the 
list from left to right, comparing each number to the pivot, placing the ones that 
are smaller than the pivot to the left, the others to the right. This creates three 
sets: 


{2, 3, 1},4, Ke 6, 9,8, 5} 


It operates likewise on the two subsets of size > 1 of this list. For instance, starting 
with subset {2, 1,3}, and choosing at random the pivot for this sublist, say 1, and 
then continuing with the subset {7,6,9,8,5} with the pivot 7, we obtain: 


1, {2, 3}, 4, {6, 5}, 7, {9, 8} 


Keep doing this until all the subsets have only one member. In this example just 
one more iteration is needed. The number of comparisons used in this specific 
example is 8 + (2+ 4) + (1+1+1) = 17. One would like to know how well this 
algorithm performs (in terms of the number of comparisons) in the general case. 
The ideal situation would be if at each splitting the median number is chosen, 
resulting in a number of comparisons approximately equal to 


n n 
Bey nae a LL 
dee er 


where there are approximately log, n terms in the sum. 


In the random quicksort algorithm, pivots are chosen randomly uniformly 
among the existing possibilities. We will compare the average number of com- 
parisons in the random quicksort to the ideal n log, n. 


Let C, be the number of comparisons needed and let X be the rank of the 
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initial pivot selected. We have, with M, = E'[C;,], 


M,, = » E[C,|X = j] P(X =3) 


j=l 
n 1 2 n—-1 
= dln 1+ My-14 Mj) x —=n—-1+—)0 Mg, 
j=l k=1 
and therefore : 
nM, =n(n—1) +250 My. 
k=1 


Subtracting the same expression with n — 1 instead of n, we have 
nM, = (n+1)M,-14+2(n—1), 


or 


By iteration, 


and therefore, finally 
M, ~ 2ninn. 


The conditional expectation of a discrete random variable Z given another dis- 
crete random variable Y is the expectation of Z using the probability measure 
modified by the observation of Y. For instance, if Y = y, instead of the origi- 
nal probability assigning the mass P(A) to the event A, we use the conditional 
probability given Y = y assigning the mass P(A|Y = y) to this event. 


Definition 2.4.2 Let X and Y be two discrete random variables taking their val- 
ues in the denumerable sets F and G, respectively, and letg: FxG + R, be either 
non-negative, or such that E||g(X,Y)|] < co. Fory € G such that P(Y =y) > 0, 
let 

dy) => g(a, y)P(X =2|Y¥ =y), (2.24) 

rer 

and otherwise, if P(Y = y) = 0, let W(y) = 0. This quantity is called the condi- 
tional expectation of g(X,Y) given Y = y, and is denoted by EY=¥[g(X,Y)], or 
Elg(X,Y) | Y =y]. The random variable w(Y) is called the conditional expecta- 
tion of g(X,Y) given Y, and is denoted by EY [g(X,Y)| or Elg(X,Y) | Y]. 
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The sum in (2.24) is well defined (possibly infinite however) when g is non- 
negative. Note that in the non-negative case, we have that 


So o@)PY =y) =) Yo g(a, y) P(X = 2 |¥ =y)PY =y) 


yeG yeG «EF 


=~ Yo ols, y)P(X =2,Y =y) = Elg(X,Y)]. 


In particular, if E[g(X, Y)] < oo, then 


So v@)PY =y) <0, 


yeG 


which implies that w(y) < co for all y € G such that P(Y = y) > 0. We observe 
(for reference in a few lines) that in this case, #(Y) < co almost surely, that is to 


say P(w(Y) < oo) = 1 (in fact, P(W(Y) = 00) => P(Y =y)=0). 


yih(y)=00 

Let now g: Fx G— R bea function of arbitrary sign such that E[|g(X, Y)|] < 
oo, and in particular E[g=(X,Y)] < co. Denote by w* the functions associated 
to g* as in (2.24). As we just saw, for all y € G, ~*(y) < co, and therefore 
w(y) = vt (y) — vw (y) is well defined (not an indeterminate co — co form). Thus, 
the conditional expectation is well defined also in the integrable case. From the 
observation made a few lines above, in this case, |E* [g(X, Y)]| < oo. 


EXAMPLE 2.4.3: BINOMIAL RANDOM VARIABLES AND CONDITIONING. Let X, 
and X be independent binomial random variables of the same size N and same 
parameter p. We show that 


X, +X. 
BX+% 1X] = A(X + Xy) = — - 2. 


We have 


P(X, = k)P(X» =n-—k) 
P(X, + Xo = n) 
(ede laa CG 


wd=2" i) 


where we have used the fact that the sum of two independent binomial random 
variables with size N and parameter p is a binomial random variable with size 
2N and parameter p. The right-hand side of the last display is the probability of 
obtaining k black balls when a sample of n balls is randomly selected from an urn 


P(X, = kX, } Xp =n) = 
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containing N black balls and N red balls. This is the hypergeometric distribution. 
The mean of such a distribution is (by symmetry) 4, therefore 


= n 
[eeu => 3 = h(n) 


and this gives the announced result. 


Exercise 5.7.14 will give a more elegant solution to the above example, and the 
reader will discover there that the result is more general. 


EXAMPLE 2.4.4: POISSON VARIABLES AND CONDITIONING. Let X, and X2 
be two independent Poisson random variables with respective means 6; > 0 and 
§2 > 0. We compute E*!+*2/X)], that is EY[X], where X := X1, Y := X, + Xo. 


Following the instructions of Definition 2.4.2, we must first compute (only for 
y 2 x, why?) 


P(X =2,Y = P(X, =2,X1+%= 
P(X =2|Y=y)= (X=2,Y =y) _ P(X =s, Xi + X2=y) 


P(Y =y) P(X, + X2=y) 
_ P(X, =2,X2=y-—2) — P(X =2)P(X.=y—2) 
P(X, + X2 = y) P(X, + X2=y) 
9,0 9, oY-* 
_ e te 62 =a _ y 6, x 05 y-x 
7 (+62) Sata x 0, +O, 0; + A» , 


+ a Oy 
Therefore, with a := Toy? 


y 


Hy) = BY¥IX] = Do a(#)ar( - ay = ay. 


«2=0 


Finally, EY [X] = W(Y) =aY, that is, 


EAs x, Ky Xe). 


_ 1 ( 
~ 6, +45 : 


We now give the main properties of conditional expectation: 


The first one, linearity, is obvious from the definitions: For all Ay, A>. € R, 


EY [Arg (X,Y) + A2go(X, Y)] = ALEY [1 (X, Y)] + AoE” [g2(X, Y)] 
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whenever the conditional expectations thereof are well defined and do not produce 
oo — co forms. Monotonicity is equally obvious: if gi(xz,y) < go(#, y), then 


E™[m(X,Y)] < E¥ [g2(X,Y)]. 
Theorem 2.4.5 If g is non-negative or such that E[|g(X,Y)|] <0o, we have 
E[E™ (9(X,Y)]] = Elg(X,Y)]. 
Proof. We have 
E(E* (g(X,Y)]] = EWI] = 50 vy) PY =y) 


yeG 
=) Y oers =2|Y =P =) 
yeG ceF 
=>" > our =2,¥ =4) 
= Elg(X, aie 


Theorem 2.4.6 If w is non-negative or such that E||w(Y)]|] < 00, 
EY lw(Y)] = w(Y), (2.25) 
and more generally, 
EY lw(Y)h(X, Y)] = w(Y)E*[A(X,Y)], (2.26) 
assuming that the left-hand side of (2.26) is well defined. 


Proof. We prove (2.26) ((2.25) follows by setting h(x,y) = 1). We consider only 
the case where w and h are non-negative, since the general case follows easily from 
this special case. We have, 


EY-Y[w(Y)h(X,¥)] = Yo w(y)h(a, y)P(X = 2 | ¥ =y) 


= wy) 5° A(a, y) P(X =x|Y =y) 


er 


= w(y)EYYA(X,Y)]. 
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Theorem 2.4.7 If X and Y are independent and if v is non-negative or such that 
E||v(X)|] < co, then 
E*[v(X)] = Elv(X)]. 


Proof. We have 
EY=[u(X)] = S > u(a) P(X =x|Y =y) 


Theorem 2.4.8 If X andY are independent and if g : FxG —> R is non-negative 
or such that E||g(X,Y)|] < co, then, for ally € G, 


Elg(X,Y |Y =y] = Elg(X,y)]. 


Proof. Applying formula (2.24) with P(X = a2|Y = y) = P(X = 2) (by inde- 
pendence), we obtain 


dy) = > g(a, y)P(X = 2) = E[g(X,y)] - 


aeF 


We now give the successive conditioning rule. Suppose that Y = (Yi, Y2), where 
Y; and Y3. In this situation, we use the more developed notation 


EY [g(X, Y)] = EY [9(X,¥%, Yo]. 
Theorem 2.4.9 Suppose that Y = (Yi, Y2) as above. If g is non-negative or such 
that E||g(X, Y)|] < oo, then 
BE? (EY (g(X, Yi, Yo)]] = BE? [9(X, Yi, Yo)]- (2.27) 


Proof. Let 
Vi, Yor Bg, Yas Ya): 


We have to show that 
EB? (b(%1, Y2)] = EY [g(X, %1, ¥2)]. 


Here 


b(n, ye) = d_ 9(@, 41, y2)P(X =2| Vi = 1, Yo =) 
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and 
EB hb(Y, Yo)] =i yi ¥)P(Y1 = 91 | Yo =), 

that is, 

BMY (V1, Yo)| 

=O 0 92, yy) P(X =2 |i =, Yo =y)PM =m | Yo=y)- 
But 

P(X =2|NYN=n,Y=y)PMN =H | Yo = y) 
_ P(X =2,%=H,%2 = ye) PM =m, Yo = ys) 
P(Y, = y1, Y2 = ya) P(¥2 = yp) 
P(X x,Y, yi | Yo Yy2) - 

Therefore 


EY=" fhy(Y;, Yo)] = 2s Lele snr) P t,Y, = y1 | Yo = ya) 


_ pi rly (X,¥1, Yo) : 


We shall see later that the above rules are very general. 
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2.5 Exercises 


Exercise 2.5.1. AN ALTERNATIVE PROOF OF POINCARE’S FORMULA 
Let Aj,...,An be events and let X,,...,X, be their indicator functions. From 
the developed expression of EF {IT?_,(1 — X;)], deduce the formula: 


P(U?_, Ai) = 2s )- $0 P(Ain Aj) 
i<j 
+ $2 P(A; APN Ag) = +++ + (H1)EP(AL 9 Ad 1+ An). 


i<j<k 


Exercise 2.5.2. NON-ESSENTIAL SET 
Let X be a discrete random variable taking its values in EF, with probability 
distribution (p(7), « € BE). Let A := {w; p(X(w)) = 0}. Show that P(A) = 0. 


Exercise 2.5.3. TTHE MEAN IS THE CENTER OF INERTIA 
Let X be a discrete random variable taking real values, with mean jy and finite 
variance o?. Show that, for alla € R, a p, 


E[(X — a)" > E[(X — p)?] =0?. 


Exercise 2.5.4. NULL VARIANCE 
Prove for an integer-valued random variable that a null variance implies that this 
random variable is almost surely constant. 


Exercise 2.5.5. GIBBS’S INEQUALITY 
Let (p(x), 2 € &) and (q(x), x € &) be two probability distributions on the finite 
space X. Prove the Gibbs inequality 


—S°p(z) log p(x) < — S > vl )log g(x), (2.28) 
LEX rex 


with equality if and only if p(x) = q(x) for alla € X. 


Exercise 2.5.6. THE ARITHMETIC-GEOMETRIC INEQUALITY 
Let x; (1 < i < n) be positive numbers, and let p; (1 < 7 < n) be non-negative 
numbers such that 5>"_, p; = 1. Prove that 


Pi ,P2 Dp 
Dx + p2%o tee. + Pn&n > Uy Ly °° bn . 
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Exercise 2.5.7. THE GEOMETRIC DISTRIBUTION IS MEMORYLESS 


Show that a geometric random variable 7 with parameter p € (0,1) is memoryless 
in the sense that for all integers k, kp > 1, P(T=k+ko|T > ko) = P(T =k). 


Exercise 2.5.8. SUM OF INDEPENDENT GEOMETRIC VARIABLES 
Let TJ, and Tz be two independent geometric random variables with the same 
parameter p € (0,1). Give the probability distribution of the sum X = T\ + 7). 


Exercise 2.5.9. FACTORIAL OF POISSON 


1. Let X be a Poisson random variable with mean 6 > 0. Compute the mean of 
the random variable X! (factorial, not exclamation mark). 


2. Compute EF [o*]. 
3. What is the probability that X is odd? 


Exercise 2.5.10. PROFESSOR NEBULOUS 

Professor Nebulous travels from Los Angeles to Paris with stopovers in New York 
and London. In each airport, his luggage is transferred to the departing plane. 
In each airport, with probability p, his luggage will not be placed in the right 
plane. Professor Nebulous finds that his suitcase has not reached Paris. What are 
the chances that the mishap took place in Los Angeles, New York and London 
respectively? 


Exercise 2.5.11. THE RETURN OF THE COUPON COLLECTOR 

In the coupon’s collector problem of Example 2.2.6, prove that for all c > 0, 
P(X > [nlnn+cn[) <e7°. Hint: you might find it useful to define A; to be the 
event that a Type i coupon has not shown up during in first [nnn + cn] tablets. 


Exercise 2.5.12. MORE BERNOULLI 

Let X1,...,X2, be independent random variables taking the values 0 or 1, and 
such that P(X; = 1) =p € (0,1) (l1<i< 2n). Let Z := 0", Xi Xn4i. Compute 
P(Z=k)(1<k<n). 


Exercise 2.5.13. STOCHASTICALLY LARGER 

Let X and Y be two integer-valued random variables. Then X is said to be 
stochastically larger than Y if for alln > 0, P(X > n) > P(Y > n). Show 
that in this case E[u(X)]| > Elu(Y)] whenever u : N — R is a non-negative 
non-decreasing function. 
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Exercise 2.5.14. THE MATCHBOX 

A smoker has a matchbox containing N matches in each pocket. He reaches at 
random for one box or the other. What is the probability that, having eventually 
found an empty matchbox, there will be k matches left in the other box? 


Exercise 2.5.15. THE ENTOMOLOGIST 
Each individual of a specific breed of insects has, independently of the others, the 
probability 6 of being a male. 


(a) An entomologist seeks to collect exactly MM > 1 males, and therefore stops 
hunting as soon as M males are captured. What is the distribution of X, the 
number of insects that must be caught in order to collect exactly M males? 


(b) What is the distribution of X, the smallest number of insects that the ento- 
mologist must catch to collect at least M males and N females? 


Exercise 2.5.16. THE ENTOMOLOGIST STRIKES AGAIN! 

Recall the setting of Exercise 2.5.15. Each individual of a specific breed of insects 
has, independently of the others, the probability 6 of being a male. An entomol- 
ogist seeks to collect exactly MM > 1 males, and therefore stops hunting as soon 
as she captures 17 males. She has to capture an insect in order to determine its 
gender. What is the expectation of X, the number of insects she must catch to 
collect exactly M males? (In Exercise 2.5.15, you computed the distribution of 
X, from which you can of course compute the mean. However you can find the 
solution more quickly, and this is what is required in the present exercise.) 


Exercise 2.5.17. THE BLUE PINKO 

The blue pinko, an extravagant and yet unregistered bird, lays T’ eggs, each egg 
blue or pink, with probability p for each given egg to be blue. The colors of 
the successive eggs are independent, and independent of the number of eggs laid. 
Example 2.3.13 showed that if the number of eggs is Poisson with mean 6, then 
the number of blue eggs is Poisson with mean @p and the number of pink eggs is 
Poisson with mean 6q. Show that the number of blue eggs and the number of pink 
eggs are independent random variables. 


Exercise 2.5.18. WALD’S EXPECTATION FORMULA 

Let {Yn}no1 be a sequence of integer-valued integrable random variables such that 
EY, | = E[Y\] for alln > 1. Let T be an integer-valued random variable such that 
for all n > 1, the event {T > n} is independent of Y,,. Let X := pe Y,,. Prove 
that 


B[X] = BMIE(). 
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Exercise 2.5.19. FAKE SYMMETRY! 
Let {Xn}n>1 be an independent sequence of {H,T}-valued (H= heads, T= tails) 
random variables such that 


P(X, =H) =5 (n> 1). 


Suppose “heads” first appears at the n-th toss. Is it true that the probability that 
n is even is equal to the probability that n is odd, and therefore equal to 3, “by 
symmetry” ? 


Exercise 2.5.20. ag; + (1 — a)go 

Show that if g; and g are the generating functions of some integer-valued random 
variables, then ag; + (1 — @)gz is also the generating function of an integer-valued 
random variable. Which one? 


Exercise 2.5.21. MEAN AND VARIANCE VIA GENERATING FUNCTIONS 


(a) Compute the mean and variance of the binomial random variable of size n and 
parameter p from its generating function. Do the same for the Poisson random 
variable of mean 0. 


(b) What is the generating function of the geometric random variable T with 
parameter p € (0,1). Compute its first two derivatives and deduce from the result 
the variance of T. 


(c) What is the n-th factorial moment (F [X(X — 1)---(X — n+ 1)]) of a Poisson 
random variable X of mean 6 > 0? 


Exercise 2.5.22. FROM THE GENERATING FUNCTION TO THE DISTRIBUTION 
What is the probability distribution of the integer-valued random variable X with 
generating function g(z) = ay (\z| < 2)? 


Exercise 2.5.23. THROW A DIE 

You perform three independent tosses of an unbiased die. What is the probability 
that one of these tosses results in a number that is the sum of the two other 
numbers? (You are required to find a solution using generating functions.) 


Exercise 2.5.24. RESIDUAL TIME 

Let X be a random variable with values in N and with finite mean m. Show that 
Pn *= + P(X > n) (n € N) defines a probability distribution on N. Compute its 
generating function G in terms of the generating function g and the mean m of X. 
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Exercise 2.5.25. A RECURRENCE EQUATION, TAKE 1 
Recall the notation a* = max(a,0). Consider the recurrence equation, 


Xn4i = (Xn = i)? = Zn+1 (n = 0) , 


where Xo is a random variable taking its values in N, and {Z,,},,., is a sequence 
of independent random variables taking their values in N, and independent of Xo. 
Express the generating function w,4,; of X,4; in terms of the generating function 
y of Z}. 


Exercise 2.5.26. POISSON AND MULTINOMIAL 

Suppose we have N bins in which we place balls in such a manner that the number 
of balls in any given bin is a Poisson variable of mean 5; and is independent of 
numbers in the other bins. In particular, the total number of balls Y, +---+ Yy is, 
as the sum of independent Poisson random variables, a Poisson random variable 


whose mean is the sum of the means, that is m. 


For a given arbitrary integer k, compute the conditional probability that there are 
k, balls in bin 1, kg balls in bin 2, etc, given that the total number of balls is 
kytees thy =k. 


Exercise 2.5.27. CONDITIONED POISSON 

Let X; and X2 be two independent Poisson random variables with respective 
means 6; > 0 and 62 > 0. Compute E*!**?/X,], that is EY [X], where X = Xi, 
Y=X,+ Xo. 


Exercise 2.5.28. MULTINOMIAL DISTRIBUTION AND CONDITIONING 
Let (X1,...,X,) be a multinomial random vector with size n and parameters 
D1,---,DPk- Compute E*![X9]. 


Exercise 2.5.29. SEVERAL ANCESTORS 
Give the survival probability of the branching process of Section 2.3 (subsection 
Branching Trees, page 58) with k ancestors, k > 1. 


Exercise 2.5.30. VARIANCE OF THE BRANCHING PROCESS 

Give the mean and variance of the size X,, of the branching process of Section 
2.3 (subsection Branching Trees, page 58) with one ancestor, and then with k 
ancestors. 


Exercise 2.5.31. BRANCHING WITH IMMIGRATION 
The branching model with one ancestor is modified as follows. The n-th generation 
(n > 1) is augmented by a random number of immigrants J,,. The sequence {In }n>1 
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is IID with common generating function gy, and each J, is independent of the state 
of the population before (<) time n. The immigrants and the other members of 
the population are indistinguishable. Show that the generating function W,, of the 
number X,, of members of the total population satisfies the recurrence equation 


Wn4i(z) = Un(9(z))gr(z) , 


where g is the common generating function corresponding to the progeny of the 
members of the population (indigenous or immigrants). 


Exercise 2.5.32. EXTINCTION TIME 

Consider the branching process with a single ancestor and typical progeny geo- 
metrically distributed (P(Z = k) = qp* (k > 0, p € (0,1), g=1-—p). Find the 
distribution of the extinction time T := inf{n; X,, = 0}. For what values of p is 
E|T| < w? 


Exercise 2.5.33. CONDITIONAL INDEPENDENCE OF TWO VARIABLES GIVEN AN 
EVENT 
Let A be some event of positive probability, and let P4 denote the probability P 
conditioned by A, that is, 

Pa) = PC-| A). 


The random variables X and Y are said to be conditionally independent given A 
if they are independent with respect to probability P,. Prove that this is the case 
if and only if for all u,v € R, 


P(A)Ele™*e®¥ 14] = Ele* 1,4] Ele” 14]. 


Check for 
updates 


Chapter 3 


Continuous Random Vectors 


Having studied discrete random variables, that is, random variables taking their 
values in a finite or countable set, we now introduce random variables taking real 
(possibly infinite) values, and random vectors with a probability density (the so- 
called “continuous” random vectors). 


3.1 Random Variables with Real Values 


We start with the definitions of a random variable and of its cumulative distribution 
function. Recall the notation R := RU {—o0 + oo}. 


Definition 3.1.1 A random variable is a function X :Q— R such that for all 
aeR, 
{X <abeF. 


This is a minimal requirement if one wants to assign a probability to {X < a}. 


If X does not take infinite values, we say more precisely that X is a real random 
variable. 


EXAMPLE 3.1.2: RANDOM POINT ON THE SQUARE, TAKE 2. (Example 1.2.4 
ct’d) Here w = (x,y), where x,y € [0,1]. Define the coordinate functions of 2, X, 
and Y by 

X(w) =2,Y(w)=y. 


Since {w; X(w) < a} = [0,a] x [0,1] is a set for which the area can be defined, X 
is a random variable. So is Y for similar reasons. 
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Definition 3.1.3 From the probabilistic point of view, a random variable X is 


described by its cumulative distribution function (for short: CDF) 


F(x) = P(X <2). (3.1) 


In particular, for all a,b € R such that a < 8, 
P(a< X <b) = F(b) — F(a) 


(watch the inequality signs). Indeed, {a < X < b} + {X < a} = {X < Dd}, and 
therefore P(a < X < b) + P(X < a) = P(X < b), from which the announced 
identity follows. 


Theorem 3.1.4 The cumulative distribution function F has the following prop- 
erties: 


(i) F:R- (0,1). 
(ii) F is non-decreasing. 
(iii) F' is right-continuous. 
(iv) For each x € R there exists F(x—) := limpyjo F(x — h). 
(v) F(4+00) := limgtoo F(a) = P(X < 00) = 1— P(X = +00). 


(vi) F(—oo) := lima F(a) = P(X = —oo). 


(vii) P(X =a) = F(a) — F(a—) for alla eR. 

Proof. (i) is obvious; (ii) If a < b, then {X < a} C {X < }}, and therefore 
P(X <a) < P(X < b); (iii) Let B, = {X < at 4}. Since Nysi {X < att} = 
{X < a} (see Exercise 4.5.2), we have, by sequential continuity, 


1 
lim P(X < a+) = P(X <a). 
nftoo nm 


(iv) We know from Analysis that a non-decreasing function from R to R has 
at any point a limit to the left; (v) Let A, = {X < n} and observe that 
Ue {X <n} = {X < oo}. The result again follows by sequential continuity; 
(vi) Apply (1.6) with B, = {X < —n} and observe that N°2,{X < —n} = 
{X = —co}. The result follows by sequential continuity. (vii) The sequence 
By, = {a—+<X <a} is decreasing, and M°2,B, = {X = a}. Therefore, by 
sequential continuity, 


P(X =a) =limP (a2 <x <a) =tim (F(a) -F(a-=)). 
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that is to say, P(X = a) = F(a) — F(a). 


Remember in particular that 


P(X = —oo) = F(—oo) and P(X = +00) =1-— F(+oo). 
From (vii), we see that the CDF is continuous at a € Rif and only if P(X = a) = 0. 


Being a non-decreasing right-continuous function, F’ has at most a countable 
set of discontinuity points on R, say {d,,n € D}, where D C N. Define the 
discontinuous part Fi, of F’ by 


F(x) == F(—00) +) (F(dn) — F(dn—)) Manso} + (1 — F(+00)) 14 400}(2) 


= P(X =—00) + S> P(X = dn) Mancay + P(X = +00)14400}(2). 


neD 


In particular, when a random variable takes its values in a denumerable subset (D 
to which one must possibly add —oo and +00), its CDF reduces to the discontinuous 
part Fy, and the sequence p(d,) = P(X = d,) (n € D) together with the values 
F'(—oo) and F'(+00) suffice to describe the probabilistic behavior of X. 


An important special case is when X is a real random variable and 


Fla)= f fiv)ey (3.2) 
for some function f > 0 called the probability density function of X. The random 
variable and its CDF are then called (absolutely) continuous. 


Note that, if X is real (no infinite values), 


[ faw=1. 
Definition 3.1.5 Two random variables X and Y are called independent if for 
alae R,bER, 
P(X <a,Y <b) = P(X <a)P(Y <0). (3.3) 


EXAMPLE 3.1.6: RANDOM POINT IN THE SQUARE, TAKE 3. Recall the model: 
Q = [0,1]?, P(A) = area of A. Let X and Y be the coordinate random variables 
defined as follows; w = (x,y), X(w) = x and Y(w) = y. We are going to prove 
that these random variables are independent. Indeed, 


{(z,y) € R252 <a,y <b} ={X <a} n{yY <d}, 
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and therefore (with 0 <a, b <1) 


PX <a} N{Y <b}) = axb 
P(X <a)P(Y <3). 


Expectation 


For a function g : R > R, the symbol 


/ _ g(x) dF (a) (3.4) 


[oe] 


denotes the Stieltjes—Lebesgue integral of the function g with respect to F’. 


The precise definition of this integral will be given in Chapter 4. For practical 
purposes, it suffices to mention that this integral is well defined for a large class 
of functions g, comprising the non-negative “measurable functions”. The class of 
measurable functions is extremely large and one can say that for practical purposes, 
“all functions are measurable”. The reader who does not feel comfortable with 
this provisional lack of precision is referred to Section 4.1 where these objects 
and the Stieltjes-Lebesgue integral are rigorously defined and where all the formal 
manipulations performed in the chapters preceding Chapter 4 will be shown to be 
licit. In this chapter, it will also be shown that a measurable function of a random 
variable is in turn a random variable, a result that we shall use a number of times. 


In the special case of a real random variable for which the continuous component 
of the CDF is absolutely continuous, that is, 


(2) = i "aa (3.5) 


the integral in Eqn. (3.4) is 


> o(dn)(F (da) — F(dy-)) + / * g(a) fela)der. 


neD 08 


The most frequent cases arising are the purely discontinuous case where F(t) = 
F(t), for which (in the case where X can take infinite values and when the values 


3.1. RANDOM VARIABLES WITH REAL VALUES 81 


g(—co) and g(+o0) are defined) 


[sare = 


F(—00)9(—00) + S© g(dn){F (dn) — F(dn—)} + (1 — F(+00))9(+00), 


neD 


and the absolutely continuous case, for which 


i * g(a) dF(e) = / © g(x) (x) de. (3.6) 


co lee} 


Definition 3.1.7 Let X be a random variable with the cumulative distribution 
function F and let the function g : R — R be either non-negative, or such that 
ae \g(a)|dF'(x) < oo (one then says that g(X) is integrable). The expectation 
of g(X) is the quantity 


Elg(X)] := 7% g(a) dF (a). (3.7) 
For a complex function g = gr + ig; : R-> C, we let 
Elg(X)] = Elgr(X)] + Elgr(X)| 
as long as E[gr(X)] and E{g;(X)] are finite quantities. 


Theorem 3.1.8 Let g, 91, 92 be (measurable) functions from R to R. We have 
(linearity) 
ElAigi(X) + Azgo(X)] = AE [gi (X)] + A2E[92(X)], (3.8) 


whenever 
(a) A1,A2 € Ry and gi and g2 are non-negative, or 
(b) either X1,A2 € R, and gq: and go satisfy the integrability condition (3.5). 


Also (monotonicity), 
Elg(X)] < Elgo(X)], (3.9) 


whenever both sides are well defined and g, < qo. 


Also (triangle inequality), 


|E[9(X)]l < Ellg(X)I]- (3.10) 


Proof. These properties follow from the corresponding properties of the Stieltjes— 
Lebesgue integral and will be admitted for the time being until Chapter 4. 
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Mean and Variance 


Definition 3.1.9 Let X be a real random variable such that E||X|] < oo. Then 
X is said to be integrable, and in this case (only in this case) we define the mean 
of X as the (finite) number 

m:= E|X). 


From the inequality |a| < 1+.a?, true for all a € R, we have that |X| < 14 X?, 
and therefore, by the monotonicity and linearity properties, E[|X|] < 1+ E[X?] 
(we also used the fact that E[1] = 1). Therefore if E[X°] < oo (in which case 
we say that X is square-integrable), then X is integrable. The following definition 
then makes sense. 


Definition 3.1.10 Let X be a square-integrable random variable. The variance 
a” of X is the quantity 
o? := E[(X —m)?]. 


The variance is also denoted by Var (X). From the linearity of expectation, it 
follows that E[(X — m)?] = E[X?] — 2mE[X]+m?, that is, 


Var (X) = E[X?]—m?. (3.11) 


In a sense, “the mean is the center of inertia of X”. By this unprecise remark, 
the following is meant: 


Theorem 3.1.11 For every square integrable random variable X with mean m, 
E|(X —)?] => E[(X —m)?] for all c. 
Proof. 


BU(X — c)?] = El(X — m)?] + 2(m — ce) B[X — m] + (m— ec)? 
= El(X —m)?] +0+ (m— 0)? > B[(X —m)”). 


Theorem 3.1.12 Let Z be a non-negative real random variable and let a be a 
positive number. Then (Markov’s inequality): 


P(Z >a) ae 


The proof is the same as in the discrete case (Theorem 2.1.23). 
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Taking Z = (X—m)? in Markov’s inequality and a = €?, we obtain Chebyshev’s 
inequality: 
fe Xe 
P(X —m| 26 < 2). 
E 


Again, the proof is the same as the one in the discrete case. 
The following analogue of Theorem 2.5.3 has the same proof. 


Theorem 3.1.13 Let X be a real random variable with mean m and variance o?. 
Then, for alla € R, 
a? < E[(X —a)’]. 


The following analogue of Theorem 2.1.25 has the same proof. 


Theorem 3.1.14 Let I be as above and let p: I > R be a convex function. Let 
X be an integrable real-valued random variable such that P(X € I) = 1. Assume 
moreover that either y is non-negative, or that p(X) is integrable. Then (Jensen’s 
inequality) 


E[y(X)| 2 y(E[X)). 


EXAMPLE 3.1.15: EXAMPLES. Let X be integrable. Then E [X?] > E[X] and 
EB [e*] > eFl*], 


Remarkable Continuous Random Variables 


Definition 3.1.16 Let a and b be real numbers. A real random variable X with 
probability density function 


1 


f(a) =~ Niaan(a) (3.12) 


is called a uniform random variable on [a,b]. This is denoted by X ~ U([a, b]). 


Theorem 3.1.17 The mean and the variance of a uniform random variable on 
[a,b] are given by 


E[X] = — Varna = (3.13) 


Proof. Direct computation. 
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Definition 3.1.18 A real random variable X with probability density function 


1 1 (w=m)? 
v) = ——e 2? 2, 3.14 
fe)=e (3.14) 


where m € R ando > 0, is called a Gaussian random variable with mean m and 
variance a”. This is denoted by X ~ N(m,o?). 


One can check that E[X] = m and Var (X) = o? (Exercise 3.6.13). 
Definition 3.1.19 A random variable X with probability density function 

f(z) = Ae Leas} (3.15) 
for some > 0 is called an exponential random variable with parameter Xr. This 
is denoted by X ~ E(A). 

The CDF of the exponential random variable is 
Fa) =f, Ae dies Ue ese) : 

Theorem 3.1.20 The mean of an exponential random variable with parameter 


as 


E[X] =a7?. (3.16 


Proof. Direct computation. Or, see the Gamma distribution below. 


The exponential distribution is memoryless in the following sense: 


Theorem 3.1.21 Let X ~ E(A). For all t,tp € Ry, we have 


PO ee neon a) 


Proof. 
PX Sgt PSH) 
P(X >to +t|X >t) = ——— — 
( = 40 a | paces 0) P(X > to) 
_ Pend 
P(X > to) 
e—A(to+t) 


= — oe At _ 
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In preparation for the next definition, recall the definition of the gamma func- 
tion T: 
iia], ae ae: 


Integration by parts yields, for a > 0, 


co [e-e) co 
-{ aut tevdu— e “udu 
0 0 0 


=al(a)—T(a+1). 


Therefore 
T'(a+1)=aT(a), 


from which it follows in particular, since T(1) = fe edz = 1, that for all 
integers n > 1, 


I'(n) = (n— 1)! 


Definition 3.1.22 Let a and ( be two positive real numbers. A non-negative 
random variable X with the probability density function 


pe 

l(a) 
is called a Gamma random variable of parameters a and 8. This is denoted by 
X ~ (a, 8). 


f(z) = allen as eee (3.17) 


We must check that (3.17) defines a probability density of a real random vari- 
able (that is, the integral of f is 1). 


Proof. Indeed: 


where the second equality has been obtained with the change of variable y = Gz. 


Theorem 3.1.23 If X ~ y(a,{), then 


E[X|= 7 and Var (X) =< 


on (3.18) 
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Proof. 
love) Bo ae 
E = cue le PF dx 
[X] | Te 
= can xo P* dx 
Ta) Jo 
_Tetiji. «a 
Tia) B Bo 
Similarly, 
T(a+2) 1  a(a+1) 
2 
BST) gee 
Therefore 


The exponential distribution is a particular case of the Gamma distribution. 
In fact, y(1, A) = €(). 


Definition 3.1.24 A chi-square random variable with n degrees of freedom is, by 
definition, a random variable X with the y(%, s) distribution. This is denoted by 
X~y?. 


Its probability density function is therefore 


1 B-1 —-de 
f(x) = QT) Core Lte>o} e (3.19) 


This distribution plays an important role in Statistics. 


Definition 3.1.25 A random variable X with probability density function 
f(t) = ae (3.20) 


is called a Cauchy random variable. 
It is important to observe that the mean of X is not defined since 


le| 
dz = +00. 
rer, —— 


Of course, a fortiori, X does not have a variance. 


3.1. RANDOM VARIABLES WITH REAL VALUES 87 


Characteristic Functions 


The notion of characteristic function brings Fourier analysis into the picture of 
probability theory. It provides a technique for manipulating probability distri- 
butions, just as the generating function does for integer-valued random variables, 
only this time for real random variables. It is also a fundamental tool for the study 
of convergence in distribution of a sequence of random variables, a notion that will 
be introduced in Chapter 7. 


Definition 3.1.26 The characteristic function (for short: CF) of a real random 
variable X is the function w:R—- C given by 


w(u) := Ele™*]. (3.21) 


Alternatively, 
w(u) = i ed F(x), 
R 


where F' is the cumulative distribution function of X. In particular, if X is an 
absolutely continuous random variable with probability density function f, 


v(u) = z oe f(x) de, 


that is, w is the Fourier transform of f. 


In the case of integer-valued random variables the generating function g and 
the characteristic function w of such a variable X are linked by w(u) = g(e’). 


EXAMPLE 3.1.27: UNIFORM. Let X be a random variable uniformly distributed 
: A 7 1 by 
on [a, |. Its CF is given by the integral = J e’* dz, and therefore 


eiubd_eiua 


iu(b—a) * 


X~U(la,b]) : v(u) = 
In the frequent special case where X is uniformly distributed on [—T, +7], 


Xs UH, 47) 4 hy) = SE, 


Tu 


EXAMPLE 3.1.28: EXPONENTIAL, GAMMA AND CHI-SQUARE. One can check 
that the following table gives the characteristic function of the corresponding ran- 
dom variables: 
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(i) EXPONENTIAL 
A 


N= iu’ 


XW €(A) | Ulu) = 
Indeed, integrating by parts: 
i (-e@™e™*) © 


2, * ce * 
= -| iue”e*” dx + | el Ne” dar 
0 0 


qu 
= Suu) +o), 
from which the result follows. 


(ii) GAMMA A standard computation gives 


X~a(a,8) Hu) = (1-15) 


In particular, with G6 = and a = 1, we recover the result for the exponential 
distribution. Also, with 6 = $ and a = 4: 
(iii) CHI-SQUARE with n degrees of freedom 
Xnx2: Yu) =(1-2iu)? . 


(This follows from (ii) since x? = y(4, 3).) 
EXAMPLE 3.1.29: Caucuy. An elementary computation gives 


hae ae ee 
w(u) = / edz =e ll, 


oo T1+2? 


Theorem 3.1.30 The characteristic function of a real random variable charac- 
terizes its distribution. 

This means that if two random variables X and Y have the same characteristic 
function, then P(X < x) = P(Y < 2) for all x € R. The proof is omitted 
at this stage since a more general result is available in Section 5.2 (subsection 
Characteristic Functions, page 185). 


Note that if f is continuous, and if w is integrable, a classical result of Fourier 
analysis tells us that 
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This is a proof, in a particular case often encountered in practice, of the general 
result that the characteristic functions indeed characterize the distribution of a 
random variable. 


Laplace Transforms 


For non-negative random variables or random vectors with non-negative coordi- 
nates, one can also work with Laplace transforms rather than with characteristic 
functions. 


Definition 3.1.31 The Laplace transform of a non-negative random variable X 
(resp., of a cumulative probability distribution function F on R,) is the function 
tER, HE le] (resp. te Riv Jr. e* dF(z)). 


The Laplace function characterizes the distribution of a non-negative random 
variable, in the sense that 


Theorem 3.1.32 Two non-negative random variables X and Y with the same 
Laplace transforms have the same distribution. 


The proof will be based on the following lemma of intrinsic interest. 


Lemma 3.1.33 Two bounded random variables X and Y such that E|X"| = 
E(Y"| for alln =0,1,... have the same distribution. 


Proof. Let M < o be a common bound of these variables. The hypothesis 
implies that for any polynomial P, E[P(X)] = E[P(Y)]. By the Weierstrass 
approximation theorem! if h : [(0,M] — R is a continuous function, there exists 
for any ¢ > 0 a polynomial P- such that supg-,<j4 |h(x) — P-(x)| < ¢. Therefore 


E|h(X) — P(X)|] <¢ and E||h(Y) — P(Y)|] Se 


and 
E||R(X) — hI) 
< E\|A(X) — P.(X)|] + BPX) — PY] + BUA) — P.(Y)|] 
= B||A(X) — P(X)|] + E[|R() — PY] < 2e. 


Since € can be chosen arbitrarily small, F [h(X)] = E[h(Y)]. By uniformly ap- 
proximating the indicator function of any interval (a, b] C [0, M] by a continuous 
function, we deduce that 


P(X € (a,8)) = E [la(X)] = E [Leon (¥)] = PW € (0,8), 


! A refinement of this fundamental result was given in Example 2.1.24. 
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that is, X Py. 


We now prove Theorem 3.1.32. 


Proof. The variables U := e~* and V := e~* are bounded and such that for all 
n=0,1,..., B[U"] = E[V"], so that U 2 V, and therefore X = —logU and 
Y =—logV have the same distribution. 


Random Vectors 


We now consider random vectors, at first sight a notion not quite novel with respect 
to random variables. However it introduces dependency between two (or more) 
random variables. 


Definition 3.1.34 A random vector of dimension n is a collection of n real ran- 
dom variables 


XS Qipsg Ma) 


From a probabilistic point of view, each of the random variables X,,...,Xy, 
can be characterized by its cumulative distribution function. However, the CDF of 
each coordinate of a random vector does not completely describe the probabilistic 
behavior of the whole vector. See Exercise 3.6.16. 


Throughout this book we shall use compact notations for multiple integrals, 
for instance 
+oo0 +oo 
| g(a) de = | of g(1,.--,%n) da, --+ dx, . 
R” —oo —oo 


A. The absolutely continuous case. Let X = (X1,...,Xn) be a random vector 
taking its values in R” and let f : R” > R, be a function such that 


+00 +00 
/ tae f(®1,..-,%n)da,-+-dr, =1. (3.22) 


Definition 3.1.35 The random vector X taking its values in R” is said to admit 
the probability density f :R” > [0,1] if 


P(X €C)= [1@ de (C € BR). 
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A random vector admitting a probability density is also called absolutely con- 
tinuous. 


B. Discrete random vectors. For convenience, we recall here a previously given 
definition with a slight change in the notation. Consider the random vector X = 
(X1,...,Xn) where all the random variables X; take their values in the same (this 
restriction is not essential, but it simplifies the notation) denumerable space F. 
Let f : E” + R, be a function such that 


» fe) = 


vekn 


Definition 3.1.36 The discrete random vector X above is said to admit the prob- 
ability distribution f if for all sets C C E”, 


P(XEC)=)° f(a). 


zEC 


In fact, as we already observed, there is nothing new here with respect to 
discrete random variables since X is a discrete random variable taking its values 
in the denumerable set ¥ := E”. 


C. The mixed case. Let X = (X1,...,X,) be a random vector of the form 
X =(Z,Y) where Z = (Xj,..., Xx) (kK <n) is a discrete random variable taking 
its values in Z = E* for some integer k > 1, where E is a denumerable set, and 
where Y = (Xj41,..-,Xn) is a random vector with values in Y = R'-*. Let 
f:2Z2x Y— R, be a function such that 


yf Aewayat. 


zEZ Rr* 


Definition 3.1.37 The random vector X above is said to admit the mixed density 
f if 
P(XE€AYEB)= > [ fe y)dy (AC E*, Be B(R")). 


zEA 


3.2 Continuous Random Vectors 


The results below have obvious counterparts concerning discrete random vectors 
and random vectors with a mixed density, sums replacing integrals when necessary. 
The corresponding statements and proofs are left to the reader. 
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Theorem 3.2.1 Let X = (Xi, X2) € R? be a two-dimensional vector with prob- 
ability density function fx,,x,. The probability density function of X, ts obtained 
by integrating out x2: 


Pa@= le fe ues) dae. 
Proof. Indeed, 
P(X, <a) = P((X1,X2) € (—0o, a] x R) 


a +00 
— i Fx,x2(%1, x2) dx; dx2 


—oo —Cco 


a +400 
/ ( Fr salts ta) a) dr, . 


Theorem 3.2.1 extends in an obvious way to the case where X; and X»2 are 
random vectors. 


EXAMPLE 3.2.2: THE BUTTERFLY DISTRIBUTION. Let X = (X1, X92) be an ab- 
solutely continuous two-dimensional random vector with probability density func- 
tion 

fx1,x2(%1,%2) = 2 x 1e(#1, £2) 
where C := ((0, 3] x [0, $]) U ([§, 1] x [$, 1]). (The support of this distribution has 
a “butterfly shape”, hence the name.) Then X; ~ U0, 1]. Indeed, if x1 € [0, 3], 


2 
fx, (21) -|/ 2dr =1, 
0 
and if a, € [4,1], 
1 
fx, (1) =f 2dr = 


2 


Similarly, X2 ~ UY/[0, 1]. 
Definition 3.2.3 For a function g : R" > R, the expectation of g(X) when X 


admits a probability density f is, by definition, the quantity 
= Jed x) f(x) da, (3.23) 


where it is required that either g be non-negative, or that 


Spm loc 


x)| f(x) dz < oo. (3.24) 
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In the case that (3.24) holds, one says that g(X) is integrable. Expectation so 
defined enjoys, mutatis mutandis, the properties mentioned for the scalar case: lin- 
earity (see (3.8)), monotonicity (see (3.9)), and the triangle inequality (see (3.10)). 

In the following theorem the hypothesis of absolute continuity is crucial. 
Theorem 3.2.4 For any two independent absolutely continuous random variables 
DX Gand W, JPOX =) = 0, 


More generally, if (X1,...,Xn) is an absolutely continuous random vector, the 
probability of the event 


C := {w; X;(w) = X;(w) for some i,j possibly depending on w} 


is null. 


Proof. We first do the proof for two random variables. Recall that for any event 
A, E' [14] = P(A). In particular, 


+00 +00 
PX =Y)= Blige] = / i g(x, y)da dy, 


where g(x,y) = lte=y} fx (x) fy (y) is null outside the diagonal. Since the diagonal 
has a null area, the integral is null. 


For the general case, we first observe that 
C = Vig {Xi = XG}, 


and therefore, by sub-o-additivity, 


P(C)< > P(X = X;) = St 0=0 
ij=l ij=l 


Independence 


For random vectors, we shall use the following type of abbreviation: P(X < a) is 
short for P(X, < a,...,Xn < Gn), where X = (X1,..., Xn) and a = (q,..., Gn) 


Definition 3.2.5 Two random vectors X and Y of respective dimensions n and 
p are called independent vectors if for alla € R",b € R?, 


P(X <a,Y <b) =P(X <a)P(Y <b). (3.25) 
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Definition 3.2.6 A sequence {Xn}nen of real random vectors is called an inde- 
pendent sequence if for any finite collection of distinct random variables 


Xi,,.--,Xi, from this sequence, 
P({Xy S aA {Xin S ag} N---N{Xi, < ay}) 
= = P(Xi, < S ay) x P(Xi, < a2) pera 4 P(X, < ay) (3.26) 
for all vectors (of appropriate dimensions) ay,...,Qr- 


Definition 3.2.7 The sequences of random vectors {Xn}nen and {Yn}nen are 
called independent sequences if 


P ((Mar{Xin S ae}) 1 (Mart Vim S Omd)) 
_ = P(Me 114, X = ay}) P (M5, m=1 Yin < bm}) (3.27) 


for all indices i1,...,i, and ji,...,js © N, and all vectors (of appropriate dimen- 
sions) ay,...,a, and by,..., bs. 


Theorem 3.2.8 A. If X),...,Xn are absolutely continuous random vectors with 
probability density functions f,,..., fr respectively, and if, moreover, X1,...,Xn 
are independent, then the probability density function of the vector (X,,...,Xp) is 
the product of the probability density functions of its components: 


He Ghigo ony) = Hul@ea) °° sale) - (3.28) 


B. Conversely, if the vector X has a probability density function factoring as in 
(8.28), where fi,..., fn are probability density functions, then X1,...,Xn are in- 
dependent random vectors with respective probability density functions f,,..., fn- 


Proof. To simplify the writing we only consider the case n = 2 for random 


variables. A. If X;, X9 are absolutely continuous random variables with probability 
density functions f,, fo respectively, and if, moreover, X,, X2 are independent, then 


P(X < 4, X2 < %) = P(X, < 21) P(X2 < 22) 


(fo a fils) ) ay.) (fo a faly ) ve) 
=f i. fi(yr) fo(y2) dys dys - 


Ly £2 
P(X, < 2, X2 < £2) =, / filys) fa(ye) dyr dye , 


B. We have 
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that is, by Fubini’s theorem, 


P(X, < %, X2 < a2) = ([" falus) dv) x ([- falus) dv) 


Letting x2 = +00 in the last identity yields 


P(X, < 4) = a filyr) dy , 


which proves that X, has the probability density function f;. Similarly, 
P(Xq < #2) = iL . fo(y2) dy2, and therefore 


P(X, < 21, Xq < £2) = P(X, < 21) P( Xe < 22), 


which proves independence. 


EXAMPLE 3.2.9: ‘THE UNIFORM DISTRIBUTION ON THE SQUARE. Let X = 
(X1, X2) be an absolutely continuous two-dimensional random vector with proba- 
bility density function 


fx,,X2 (£1, 22) = 1joyj2(1, £2) « 


Since this probability density function factors as the product. 1jo,1)(%1)  1)0,1)(%2) 
of two probability density functions of uniform distributions on [0, 1], X1 ~ Y[0, 1], 
X_ ~ U|0,1] and they are independent. 


EXAMPLE 3.2.10: ‘THE UNIFORM DISTRIBUTION ON THE DISK. Let X = 
(X1, X2) be an absolutely continuous two-dimensional random vector uniformly 
distributed on the disk D = {(a1, 22); x7 + 73 < 1}. Its probability density func- 
tion is 1 

fx,,X9(©1,%2) = — 1p (21, 22) ; 
Clearly, this probability density function does not factor as the product of two 
probability density functions and therefore X, X2 are not independent. 


Here again, a discrete and a mixed version of Theorem 3.2.8 are available. 
Product Formula for Expectations 


The following result was already given in particular cases in Theorems 2.1.27 and 
2.3.6. 
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Theorem 3.2.11 Let Y and Z be independent random vectors of dimension p 
and q respectively. If g, : RP + © and gg: R42 > C are such that g,(Y) and go(Z) 
are integrable, then the product g\(Y )g2(Z) is integrable and 


Elgi(Y )g2(Z)] = Ela (Y )|E[92(Z)| ° (3.29) 


Formula (8.29) holds true without condition if g, and gz are real non-negative 
functions. 


Proof. We consider the case where Y and Z admit probability densities fy and 
fz. (The fully general result will be given in Theorem 5.4.4.) By Theorem 3.2.8, 
the vector X = (Y, Z) admits the probability density 


fy.zy, 2) = fy (y) fa(2) (3.30) 


and the result follows by Fubini’s theorem: 


Ela (Y)92(Z)] = ff sludonte\to (fete) dy dz 
= f wlvyfew)ay x f oole)fele)ae. 


Freeze and Integrate 


The next result is convenient for computing expectations of functions of two in- 
dependent vectors. It says that one may fix one of these vectors, compute the 
expectation with respect to the other vector, and take the expectation of the result 
with respect to the previously fixed vector. In other words: freeze and integrate. 


Theorem 3.2.12 Let X,, X_9 be independent random vectors of dimensions n, 
and ng, and with probability density functions f, and fo respectively. Then for any 
function g : R™ x R™ — R that is either non-negative or such that g(X 1, X2) is 
integrable, 


Elg(X1, X2)] = S72 Elg(y, Xa) Aaly)dy. 
Proof. We have 


E[g(X1, X2)] = [- [ seewsa)files)fales)andn 
+00 { fens) fales)ira ay 


Co 


t 
2 
= 
& 
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EXAMPLE 3.2.13: THE COMPUTATION OF P(X, > X2). Let X1, X2 be as in 
Theorem 3.2.12. Then 
+00 


P(X, > X2) = fi Ply > Xa) fily)dy = / (1 — Fa(y)) fi(y)dy. 


=o —oo 


+coo 


To prove this, it suffices to apply Theorem 3.2.12 with g(#1, v2) = 1y2,2.} and to 
observe that Elg(y, X2)| = E[lgysx.}] = P(y > X2). 


If for instance X, ~ E(A,) and Xy ~ E(A2), we obtain by application of the 
last displayed formula, 
oo 
P(X > Xo) = / e Ye dy 


—co 


[- ABV 9 A1tAr2)¥q At 
= e 4 1e = . 
—o00 : u Ai + A2 


We now give the convolution formula for the probability density function of a 
random vector that is the sum of two independent absolutely continuous random 
vectors. 


Theorem 3.2.14 The probability density function of the random vector 
Z= X+Y, where X and Y are independent random vectors with the same 
dimension n and with respective probability densities fy and fy, is given by the 
convolution formula 


falz) = Jan fv(z—y)fx(y)dy. (3.31) 


Proof. (n = 1 for simplicity) The probability density function of the vector (X, Y) 
is fx(x)fy(y), and therefore, for alla € R, 


PZ <a) = PUX+Y <a) = Ellyxyycey 
+00 +00 
= / / liniy<apt x(a) fy (y)dady . 


The latter integral can be written, by Fubini’s theorem, 


[- {f- I yco-n)(u)au} fx(x)de 
7 [- tes. fray} fr(a)de, 


98 CHAPTER 3. CONTINUOUS RANDOM VECTORS 


that is, after an obvious change of variable, 


PZsa)= fi { [ fvle—2)fxtejach dz. 


—co 


EXAMPLE 3.2.15: SUM OF INDEPENDENT UNIFORM RANDOM VARIABLES. Let 
X ~ Ul0,1] and Y ~ U[0,1] be two independent random variables. Then Z = 
X +Y admits the “triangular” probability density function 


f2(z) = zlpoy(z) + A - z)laagy(2). 


This is an immediate application of (3.31) with fx(x) = lpaj(a) and fy(y) = 
1po,1()- 


Characteristic Functions and Laplace Transforms of Random Vectors 


The definition and properties of characteristic functions readily extend to the case 
of random vectors. 


Definition 3.2.16 The characteristic function of the real random vector X = 
(X1,...,Xn) is the function w:R”" > C defined by 


w(u) = Efe’ *). (3.32) 


In the case where the X is an absolutely continuous random vector with continuous 
probability density f, 


w(u) = i el" F(x) da. 
If moreover, 


| Wo(u)|du <0, 
R" 


a theorem of analysis tells us that the probability density function of X is then 
given by the Fourier inversion formula 


fa) = 5- | “(ue du. 


(The proof is given in Corollary 5.3.3.) 


Theorem 3.2.17 Jf two random vectors X and Y have the same characteristic 
function, they have the same distribution. 


The proof is postponed until Corollary 5.3.2. 
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Definition 3.2.18 The Laplace transform of a vector of non-negative random 
variables (X1,...,Xm) is the function (t1,...,tm) € RP 4 E [e- eu E 
[0, 1]. 


The following result, which will be admitted,” generalizes Theorem 3.1.32. 


Theorem 3.2.19 Two non-negative random vectors 
Gy ean) and (Vinee) 


with the same Laplace transforms have the same distribution. 


Characteristic Function Test for Independence 


Characteristic functions give one of the most useful criteria for testing indepen- 
dence of random vectors. 


Theorem 3.2.20 Suppose that Y and Z are two random vectors of respective 
dimensions p and q, and that for allv € RP, w € R4, it holds that 


Bele" ¥+¥*Z)) = yi (v)yo(w), (3.33) 


where w(v) and w2(w) are the characteristic functions of some random vectors Y 
and Z of respective dimensions p and q. Then Y and Z are independent, Y has 
the same distribution as Y, and Z has the same distribution as Z. 


Proof. Define X = (Y, Z) and u = (v, w), so that (3.33) reads 


Efe *] = b(u) = vr(o)do(w). 


Consider two independent random vectors Y and Z , where Y is distributed as Y, 
and Z is distributed as Z. Let X = (Y,Z). Then, by the product formula for 
expectations, 
Ele*"*) = Ele”? ev" 2) = Ele" * Ele" 7] 
= Efe’ Y Ele" 7] = dn (v)v2(w). 


Therefore, (Y, Z) has the same distribution as (Y, Z) and in particular, Y and Z 
are independent. 


2 See Chapter 6 of [11]. 
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EXAMPLE 3.2.21: CONVOLUTION FORMULA VIA FOURIER. We give an alter- 
native proof of Theorem 3.2.14. Define 


(fv ® fa)(2) = fan Fv (@ — 2) fa(2) de. 


First observe that f := fy * fz is a probability density function, that is, a non- 
negative function integrating to 1: 


eee i ‘ | Ave—2) fal?) 7) da 


-[ ( | fele-2) ar) falz)dz 
=f ux iaide = 1: 


By a classical result of Fourier analysis, the Fourier transform of fy * fz is the 
product of the Fourier transforms of fy and fz, that is 


ein f(z) dr = Ele Y Ele 7] = He +) 
R" 


where we have used the independence of Y and Z for the second equality. Thus 
Ele’ “+2)] is the characteristic function of Y + Z and of a random vector with 
probability density function f(x). Therefore f is the probability density function 
of Y4+ Z. 


Random Sums and Wald’s Identity 


The next two results concerning random sums were already proved in the discrete 
case (Theorems 2.3.12 and 2.5.18). 


Theorem 3.2.22 Let {Yn}n>1 be an MD sequence of random variables with the 
common characteristic function py. Let T be a random variable, integer-valued, 
independent of the sequence {Yn}n>1, and let gr be its generating function. The 
characteristic function of the random variable 


Xo ae 
where by convention So = (0, as 


yx(u) = gr (py(u)) - (3.34) 
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Proof. We need only to adapt the proof of Theorem 2.3.12. We have 


eiux = eit! ona Yn (>: ta eit! ona Yn 
k=0 
co a love) ‘ 
-)- { (cinEea i Iran} -S- (cinta % ) Lge: 
k=0 


k=0 
Therefore, 


k k 
fe) -ye [tera (e* net me ye [Ipr- JE [et naa al ; 


=0 k= 
where we have used independence of T’ and ss, Now, E[lrr=a}] = P(T = k), 
and Ele ©n=1¥] = yy(u)*, and therefore 


[e™*] yr P(T y(u)" = gr (py(u)) - 


Theorem 3.2.23 Let {Y,}n>1 be a sequence of integrable random variables such 
that E[Y,| = E[Yi] for alln > 1. Let T be an integer-valued random variable such 
that for alln > 1, the event {T > n} is independent of Y,. Let 


Xe ean 


Then 
E[X| = E|YJEIT). (3.35 


Proof. Same as that of Theorem 2.5.18. 


Smooth Change of Variables 


Let X = (X1,...,Xn) be a random vector with the probability density function 
fx, and define the random vector Y = g(X), where g : R” + R”. More explicitly, 


Yi = g(X1, oe ., Xn), 


Yn = Gn(X1, =. 24 An) 


Under smoothness assumptions on g, the random vector Y is absolutely continuous, 
and its probability density function can be explicitly computed from g and the 
probability density function fx. These assumptions are the following: 
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A,: The function g : U > R", where U is an open subset of R”, is one-to-one 
(injective). 


Ag: The coordinate functions g; (1 <i <n) are continuously differentiable. 


Az: Moreover, denoting the Jacobian matrix of the function g by 


Ig Bi jst En) = { 304(, whe tn) 


) 
1<i,j<n 


we assume that on U, 
| det Jy(v1,...,2n)| > 0. 
A standard result of Analysis says that V = g(U) is an open subset of IR”, and 


that the invertible function g : U — V admits an inverse g~! : V > U with the 
same properties as the direct function g. In particular, on V, 


| det Jg—1(M1, =e ae | > 0. 


Moreover, 
Iga(y) = Ig(g*(y)) 


Also, under conditions A; — A3, we have the basic rule of change of variables of 
calculus: For any function u : R" > R”, 


[ woe = / to DI det Jy) 


Theorem 3.2.24 Under the conditions just stated for X, g, and U, and if more- 
over P(X €U) =1, then Y admits the probability density 


fy(y) = fx(g""(y))| det Jg(g-*(y))I- Ivy) - (3.36) 
Proof. The proof consists in checking that for any bounded function h: R > R, 


EIMY) = [hw ewdey, (3.37) 


where w is the function on the right-hand side of (3.36). Indeed, taking h(y) = 
ly<a = ly, <a; ** + Lyp<an, (3-37) reads 


POY S aay..-,¥o Stn) = f af W(Y1,-- +) Yn) dy: + dyn - 


3.2. CONTINUOUS RANDOM VECTORS 103 


To prove that (3.37) holds with the appropriate wy, one just uses the basic rule of 
change of variables: 


E[K(Y)] = ElA(o(X))] = [ h(g(2)) frx(a)dex 
: i, h(y) fx(97¥(y))| det Jy (y) lay. 


Corollary 3.2.25 Let X be an n-dimensional random vector with probability den- 
sity fy. Let A be an invertible nxn real matriz and b an n-dimensional real vector. 
Then, the random vector Y = AX + B admits the density 


fry) = fx(A“(y - 0) aaa - (3.38 


Proof. Here U = R”, g(x) = Ax + b, and | det Jy-1(y)| = ara: 


EXAMPLE 3.2.26: POLAR COORDINATES. Let (X 1, X2) be a 2-dimensional ran- 
dom vector with probability density fx,.x,(v1,%2), and let (R,O) be its polar 
coordinates. The probability density of (R, ©) is given by the formula 


Fro(T, 9) = fxi.x2(7 cos 8, rsin 8) r. 


Proof. Here g is the bijective function from the open set U consisting of R? 
without the half-line {(21,0); «1 > 0} to the open set V = (0,00) x (0,27). The 
inverse function is 

xr =rcosé y=rsind, 


with Jacobian 


sinO rcos@ 


ite oy ee —rsin 2 


of determinant det J,-1(r,@) = r. Applying formula (3.36), we obtain the an- 
nounced result. 


There are cases of practical interest where the function g does not admit 
an inverse but where its domain U can be decomposed into disjoint open sets 
(say, 2): U = U, + Us, such that the restrictions of g to U, and Up, respectively 
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gi and go, satisfy the conditions of smoothness and of injectivity of the standard 
case. In this case the same method applies, but one must dissociate the integrals: 


[noe seear= ff nae tetoyae+ f ba(e)yfete) ax 


and apply the formula of smooth change of variables to each part separately. This 
gives 


B(ny)|= | 


h(y)fx(gr'(y)) J,,+(v)| ays f h(y)Fx(g2'(y)) J,;+(u)| dy 
m(U1) 2) 


g2(U: 


and therefore 


fey) = flor) [Jy-0) 


lowly) + fe(92"(u)) [ys | Loan y) 


Order Statistics 


We now give a formula which allows us to compute the probability density function 
of the random vector obtained by reordering the coordinates of a given absolutely 
continuous random vector. 


Let X1,...,X, be independent random variables with the same probability 
density function f. We know (see Theorem 3.2.4) that the probability of two or 
more among Xj,...,X, taking the same value is null. Therefore one can define 
unambiguously the random variables Z),...,Z, obtained by arranging X1,..., Xn 
in increasing order: 


In particular, 7; = min(X),...,X,) and Z, = max (Xj,..., Xn). 


Theorem 3.2.27 The probability density of the reordered vector Z = (Z,,..., Zn) 
(defined above) is 


JA Bin aang Zp) =i! NES f(z)} Mela c ony ea)) 6 (3.39) 


mlucne (C! = (Bit, aoay 2a) SING 2 ea SK ey K os < a 


Proof. Let o be the permutation of {1,...,n} that orders X1,..., X, in ascending 
order, that is, 
Xo i) = Zi 
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(note that o is a random permutation). For any set A C R”, 


P(Z€ A)=P(ZE ANC) 
= P(X,€ ANC) => P(X, € ANC,o =o), 


% 


where the sum is over all permutations of {1,...,n}. Observing that X,, € ANC 
implies ¢ = 0, 


P(X,, € ANC,o =0,) = P(Xz, € ANC) 


and therefore since the probability distribution of X,, does not depend upon a fixed 
permutation a, (here we invoke the independence and equidistribution assumption 
for the X;,’s), 

P(X,, EANC)=P(X E ANC). 


Therefore, 


P(ZEA) = S > P(X € ANC) =nIP(X € ANC) 


%o 


= ni fx(e)de = [ nifs(a)o(e)ae. 


EXAMPLE 3.2.28: VOLUME OF A RIGHT-ANGLED PYRAMID. We shall apply 
the above result to prove the formula 


b b 
b = n 
/ af lo(21,---;2n)d21++- dz, = ( a ; (3.40) 
” - n! 
Indeed, when the X;’s are uniformly distributed over |a, }], 
n! 
fa(a, cae) a) = aay tlasin (1, eae) 2n)lo(a, aoe Zi) : (3.41) 


The result follows since (ae fz(z)dz =1. 


EXAMPLE 3.2.29: THE i-TH SMALLEST UNIFORM. We seek the probability 
density function of the random variable Z;, the ith smallest among X,...,Xn, 
when the X;’s are independent random variables uniformly distributed on [0, 1}. 
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By Theorem 3.2.27, the distribution of Z = (Z,..., Zn) is 


fa{@y..+5 2a) = 71 Llounne (is s2 55 2a)-3 


where C = {x <a <-+-+< x,}. The density of Z; is obtained by integrating fz 
with respect to 21,..., 2-1; 2i41,+++5 Zn! 


1 1 
fz, (z) = nt f os ai Ley <e<zj 1 SS zig 1 S<en)d21 eyed dz;-10 241 Aree dzn 
0 0 
KK 
1 


n 
1 1 
= nt f a a Vay <<a_1<z)d%1 pati dzj-1 Kees 
0 0 
eV — 
i-1 


1 1 
: f a i L(eczigi<--<en) A241 ae dz 5 
0 0 


SS 


n-t 


that is, in view of the result of Example 3.2.28, 


Sampling a Distribution 


We now address a problem that arises in the context of simulation of stochastic 
systems. It consists in generating a random variable with prescribed CDF, or in 
other terms, sampling the said CDF. For this, one is allowed to use a random gen- 
erator that produces a sequence U,,U5,... of independent real random variables, 
uniformly distributed on [0,1]. In practice, the numbers that such random gener- 
ators produce are not quite random, but they look as if they are (they are called 
pseudo-random generators). The topic of how to devise a good pseudo-random 
generator is out of our scope, and we shall admit that we can trust our favorite 
computer to provide us with an IID sequence of random variables uniformly dis- 
tributed on [0, 1] (from now on we call them random numbers). 


We now give two methods for constructing a random variable Z with CDF 
F(z) = P(Z <2). 

In the case where Z is a discrete random variable with distribution 
P(Z = a;) = p; (0 < i < AK), the basic principle of the sampling algorithm is 
the following 


3.2. CONTINUOUS RANDOM VECTORS 107 


(a) Draw U ~ (0, 1). 
(8) Set Z = ae if po +pit+-+++pe-1<U <potpit-:: +0. 


This method is called the method of the inverse. 


A crude generation algorithm would successively perform the tests U < po?, 
U <po+p1?, ..., until the answer is positive. The average number of iterations 
required would therefore be }7,.9(¢ + 1)p; = 1+ E[Z]. This number may be too 
large, but there are ways of improving this, as Example 7.2.2 will show for the 
Poisson distribution. 


The above method can be generalized to real random variables. Since for 
u € (0,1), the set {#; F(x) > u} is an unbounded interval of R, it admits a 
smallest element denoted by F’*(u): 


{x ; F(a) 2 u} = [F* (wu), +00). 


The function F’~ so defined on (0,1) is non-decreasing. It is called the pseudo- 
inverse of F and coincides with the inverse function F~' when F is continuous 
and strictly increasing. 

Theorem 3.2.30 Jf U is a uniform random variable on (0,1), then F~(U) has 
the same probability distribution as X. 


Proof. First note that for all u € (0,1), F“(u) < t implies F(t) > u. Indeed, 
in this case, for all s > ¢ there exists an x < s such that F(x) > u and therefore 
F(s) > u; and consequently, by right-continuity of F', F(t) > u. Conversely, 
F(t) > wu implies that t € {#; F(x) > u} and therefore F(u) < t. Taking all this 
into account, 


This forces P(E (U) < t) to equal F(t). 


EXAMPLE 3.2.31: EXPONENTIAL DISTRIBUTION. We want to sample from €(A). 
The corresponding CDF is 


F(z)=1-—e-* (z>0). 


The solution of y= 1—e~™ is z = —+In(1 — y) = F7"(y), and therefore, Z = 
—+ In(1 — U) will do, or since U and 1— U have the same distribution, 
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Here is an often useful trick. 


EXAMPLE 3.2.32: SYMMETRIC EXPONENTIAL DISTRIBUTION. We want to sam- 
ple from the symmetric exponential distribution with probability density function 


fe) = jer". 


One way is to generate two independent random variables Y and Z where Z ~ E(A) 
and P(Y = +1) = P(Y 1) = 4. Taking X = YZ we have that 


P(X <2) =P(U =41,2 <2) + P(U =-1,Z > -2) 


1 
= 5 (Fa(z) +1— Fe(-2)), 
and therefore, taking derivatives, 


fixe) = 5 (Fela) + fala) = 5 fall). 


The computation of the inverse of the cumulative distribution function of the 
random variable to be generated may be difficult. An alternative method is the 
method of acceptance-rejection below. 


Let {Y;}n>1 be a sequence of IID random variables with probability density g 
that satisfies the two requirements below: 


(i) it is easy (or at least feasible) to sample it, and 


(ii) for alla eR 


mH, 
i— 


(a 


<e (3.42) 


RS} 
S 


for some finite constant c (necessarily larger or equal to 1). 


Let {Un}n>1 be a sequence of 11D random variables uniformly distributed on 


(0, 1). 
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Theorem 3.2.33 Let 7 be the first index n > 1 for which 


opal len A = Ve, Wine 
(a) Z admits the probability density function f, and 
(b) E[r] =e. 
Proof. We have 
Pag aPyen => Praia. 


Denote by A; the event {U, > iS}. Then 


Plr=mYaSa) = PlAtyeoes Ant FaYa $2) 


PUn%ese) = [rust airy le 
i 


Therefore 


Also, using the above calculations, 


P(r =n) = P(Aj,...,An—1, An) 


= P(A,)-+-P(An-1)P (Ay) = (1 ae af 


from which it follows that E[7] = 


109 


We see that the method depends on our ability to easily generate random 
vectors with the probability density g. Also we have to select a probability density 


function satisfying the constraint (3.42), with c as small as possible. 
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3.3. Square-integrable Random Variables 


Definition 3.3.1 A complex random variable X is said to be square-integrable if 
E||X|?] < co. 


Theorem 3.3.2 The set of complex square-integrable random variables, denoted 
Li(P), is a vector space with scalar field C. Similarly, the set of real square- 
integrable random variables, denoted £L},(P), is a vector space with scalar field 
R. 

Proof. We show that if X and Y are complex square-integrable random variables 
and A € C, then AX and X-+Y are square-integrable. The first assertion is obvious. 
For the last assertion use (for instance) the inequality (a +6)? < 2a? + 20’, true 
for all a,b € R, to obtain 


EX +Y|?] < E[(|X| + |¥1)"] 
< El(|X| + |¥|)"] < 2B[(|X1)"] + 2B[(¥1)"] < co. 


Lemma 3.3.3 A non-negative random variable Z such that E[Z] = 0 is almost 
surely equal to 0. 


Proof. By Markov’s inequality, P(Z > +) < nE|Z]| = 0, and therefore, by the 
sequential continuity of probability 


riz>0)=P( 


iC 8 
N 
IV 
Slr 
——’” 
Sai re 


that is, P(Z = 0) =1. 


Inner Product and Schwarz’s Inequality 


Definition 3.3.4 Let H be a vector space with scalar field K = C or R, endowed 
with a mapping from H x H to K associating to the pair (x,y) of vectors of H the 
scalar (x,y), and such that for all x,y,z € H andallX€ K, 


1. (y,2) = (z,y)*, 
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2. (Ay, z) = Ay, 2), 
3. (r,y + 2) = (x,y) + (2, 2). 


The map (a, y) +> (a,y) ts called an inner product, and (x,y) is called the inner 
product of x andy. The vector space H, when endowed with such an inner product, 
is called a pre-Hilbert space. 


If we take 
(X,Y) := E[XY*] 
for inner product of £3,(P), the three conditions above are obviously satisfied. 
Two vectors x and y in H are called orthogonal if (x,y) = 0. 
For any « € H, let 
lla||? := (x, 2) . 


Theorem 3.3.5 For all x,y € H, 


(x,y) < Ilell x Ill, 
with equality of and only if x and y are colinear. 


Proof. Say kK = C. If x and y are colinear, that is = Ay for some A € C, the 
inequality is obviously an equality. If x and y are linearly independent, then for 
all A € C, 7+ Ay 40. Therefore 


0 <|lx + Ayll? = [lal]? + [Ay Ay? + A*(@, 9) + Aa, 9)" 
= |[a||? + [AP Ilyll? + 2Re(A*(a, y)) 
Take u € C, |u| = 1, such that u*(a, y) = |(,y)|. For t € R, let A := tu. Then 
0 < |lxl|? + #llyll? + 2¢|(x, 9]. 


This is true for all t € R. Therefore the discriminant of this second degree equation 
in t must be strictly negative, that is, 4|(, y)|? — 4||a||? x ||y||? < 0. 


The Correlation Coefficient 


Schwarz’s inequality for square-integrable random variables reads: 
|E[XY]| < El|XY]] < BUY P|? x E(LXPL?. (3.43) 
In particular, with Y = 1, 
E||X|] < EI|XP]2 < co. (3.44) 
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Definition 3.3.6 Two complex square-integrable random variables are said 
to be orthogonal if E[XY*] = 0. They are said to be uncorrelated if 
BUX = mx)(Y = my )*] =), 


Definition 3.3.7 The covariance of the two complex square integrable variables 
X and Y is, by definition, the complex number E |(X —mx)(Y — my)*]. It will 
be denoted by oxy. 


Definition 3.3.8 Let X and Y be square-integrable real random variables with 
respective means mx and my, and respective variances 0%, > 0 and o}. > 0. Their 
correlation coefficient is the quantity 


where oxy 1s the covariance. 


By Schwarz’s inequality, |Jaxy| < ox oy, and therefore 


loxy| <1, 


with equality if and only if X and Y are colinear. Recall that when pyy = 0 
X and Y are said to be uncorrelated. If pxy > 0, they are said to be positively 
correlated, whereas if pxy < 0, they are said to be negatively correlated. 


The next result provides an interesting interpretation of the correlation coefhi- 
cient. 


Theorem 3.3.9 Let X be a square-integrable real random variable. Among all 
variables Z = aX +b, where a and b are real numbers, the one that minimizes the 
error E|(Z —Y)?] is 
¥ = my + “(x — mx) 

Ox 


and the error is then A 
E[(Y —Y)"] = 71 — xy). 


This is a particular case of the forthcoming Theorem 3.3.16. 


We see that if the variables are not correlated, then the best prediction is 


the trivial one Y = my and the (maximal) error is then o}-. In imprecise but 


suggestive terms, high correlation implies high predictability. 
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Covariance Matrices 


Recall the notation in use in this book for vectors and matrices: an asterisk super- 
script (*) denotes complex conjugates, a T superscript (") is for vector transposi- 
tion, and the dagger superscript (*) is for conjugation-transposition. When « is a 
vector of IR”, we shall always assume in the notation that it is a column vector, 
and therefore «7 will be the corresponding line vector. 


Definition 3.3.10 A random vector X = (X1,...,Xpn)" such that X1, ..., Xp are 
square-integrable complex random variables is called a square-integrable complex 
vector. 


In particular, by (3.44), 
B||Xil]<0o (1<i,j <n) 
and by Schwarz’s inequality (3.43), 
E||X:X;|]|<0o (l<i,j <n). 
Therefore, the mean 
mx := E[X] = (E[Xj],..., E[Xn])” 
and the covariance matriz of X 


Ty := E[(X —mx)(X — mx)'] 
= {B[(Xi — mx)(Xj— mx) 
= { cov (X;, X;) 


i 1<i,j<n 


hicnjen 


are well defined. 


Theorem 3.3.11 The matrix Tx is symmetric Hermitian, that is, 

rh =Tx, (3.45) 
and it is non-negative definite, that is, 

a'Tya>0, (3.46) 


for alla € C". This is denoted by ly > 0. 
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Proof. 
a'Ta =a'Ta* 


n 


Slo asa El(Xi — ELX)(X; — BG") 


=E 2. 2 aia%(X; — E[Xi])(X; - ax" 


-E (>: ai(Xi eux) bs aj(Xj - atx) | 


i=1 


= Ella" (X — E[X])?] = 0. 


Theorem 3.3.12 Let X be a square-integrable real random vector of dimension 
n > 2 with a covariance matrizt x. which is degenerate, that is, 


alTya = 0, 


for somea € R", a £0. Then, X lies almost surely in a hyperplane of IR” 
of dimension strictly less than n, and cannot have a probability density. 


Proof. For such a, E[|a7(X — E[X])|?] = a7T xa = 0, and therefore, by Theorem 
3.93, 


a"(X — E[X]) =0, 


almost surely. Suppose the existence of such a probability density f. Then, de- 
noting by II the hyperplane in question, 


P(X € Tl) = [ fear, 


a null quantity since the n-volume of an hyperplane of R” is null. 


If [x is non-degenerate, we write .y > 0. A vector X with degenerate covari- 
ance matrix is also called degenerate. 


We now examine the effects of an affine transformation of a random vector on 
its covariance matrix. Let X be a square-integrable n-dimensional complex random 
vector, with mean mx and covariance matrix 'y. Let A be an (n x k)-dimensional 
complex matrix, and b a k-dimensional real vector. 
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Theorem 3.3.13 The k-dimensional complex vector Z = AX +b has mean 
mz = Amx+b 
and covariance matrix 
[z= ATxAl. 


Proof. The formula giving the mean is immediate. As for the other one, it suffices 
to observe that (Z — mz) = A(X — mx) and to write 


Tz= a (Z—mz iF mz)'] 
aes —mx)( —mx)'A‘] 
= AE [(X —mx)(X —mx)"] Al = ADXAl. 


Let X and Y be square-integrable complex random vectors of respective di- 
mensions n and q. We define the covariance matrix of X and Y—in this order—by 


Ixy = BUX = mx)(Y = my)" | : 
Note that 
Tyy= Ty: 
Also, for the (n + q)-dimensional vector 
B= (Kip ccng hay Viieerg al 


the covariance matrix takes the block diagonal form 


Linear Regression 


Let Y, X1,...,X~ be square-integrable real random variables. We now consider 
the problem of the best linear-quadratic approximation of Y based on Xj,..., Xn. 
More precisely, we seek a real vector a = (a1,...,ay)" such that the linear com- 
bination Y := a? X = yy a;.X; satisfies 


E (IY - YIP] < Bly - 217] 


for every linear combination 7 = ae b;:X;, where b = (by,.. by)? is a real 
vector. 
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Definition 3.3.14 The random variable Y achieving the minimum is called the 


linear regression of Y on X1,...,Xy, or, again, the best linear-quadratic approx- 
imation of Y as a function of X1,...,X~. The vector a is called the regression 
vector. 

Letting 


F(b) :=E[||Y -Z|?] =£ 


(Y—-So aX) —S74%,)| , 


we have 


(ys 3 nx] = 28 |(¥ =Z)%) . 


i=1 


OF 
Ob; 


(b) = -2E 


On writing OF /0b; = 0 (1 <i < N), we see that a vector a realizing an extremum 
of F and the corresponding approximation Y = a! X satisfy the system 


E lw = Y)x;| =0 (1<i<Q). (3.47) 


The N preceding equations may be written as a function of the unknowns 
Q1,.+-+,4N, 


N 
5 ajE[X;X] =B[YX] (1<i<), (3.48) 
i=1 
or, in matrix form: 
E [X1X1] EB [X1X9] J. 6 [Xi Xn] ay EB [Y Xi] 
E [X2X1| E [X2X9] .. ££ [XX yn] ag _ EB [Y X9] 
E|XnX\| E|XnX9] Soest E|XnXn] an E|Y Xn] 
More compactly, 
Tya = Tyy é (3.49) 


In view of (3.47), we have E lw — ¥)¥| = 0, and therefore, 


fe = Bl vy) El yyy]. (3.50) 


The covariance matrix ['y is non-singular if and only if X,,...,Xy are linearly 
independent vectors. In this case (3.48) admits a unique solution. Thus under the 
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condition [y > 0, we have a unique extremum, which we know to be a minimum 
because the coefficients of the squares of the quadratic form F are positive (at least 
if we assume, without loss of generality, that none of the X; is the null vector). 


In summary, 


Theorem 3.3.15 Let Y,X,,...,Xy be real square-integrable centered random 
variables. A necessary and sufficient condition for Y = a,X,+...anXwy to be 
a best quadratic approximation of Y by a linear function of X1,...,Xwy is 


E|(¥-¥)x,] Si) Gx 7a 
The regression vector is given by (3.51) 


Iya = Ixy (3.51) 


and the minimum quadratic error d? = E [lv - vp is given by d? = (Y—Y,Y). 


We now assume linear independence (in the algebraic sense) of X1,...,Xy, 
which is expressed by the condition 


Ty >0, (3.52) 
in which case ry exists and therefore there exists a unique regression vector 


a= ryTxy, so that 
Y= TyxTyX. (3.53) 


We now consider the case where the random variables Y and X,,...,Xy are 
no longer assumed to be centered (but we keep the condition (3.52)). The problem 
is now to find the affine combination of X,,...,Xj) which best approximates Y 
in the least-squares sense. In other words, we seek to minimize 


BY ty 0X Hobe 


with respect to the scalars bo,...,by. This problem can be reduced to the pre- 
ceding one as follows. In fact, for every square integrable random variable U with 
mean m, 


E|(U —c)?] > E[((U—m)*] for all c. 
(Exercise 3.1.11.) Therefore 
ElY — 7X —b)? > El(Y —b?X — Ely — 07x), 


where 
b= (biy 222, by): 
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This shows that bp is necessarily of the form bb = my — b’mx. Therefore we have 
reduced the original problem to that of minimizing with respect to b the quantity 
E|((Y — my) — 6"(X — mx))°], and for this we can use the result obtained in the 
case of centred random variables. 


Theorem 3.3.16 Jf X is nondegenerate, the best linear-quadratic approximation 
of Y as an affine function of X is 


Yy =mMmy + yale CX = mx) 5 (3.54) 
The minimum quadratic error is then given by 
E((Y — Y)?] = 02 —TyxlyTxy. (3.55) 
Proof. It remains to prove (3.55). From (3.54), we have 


E\(Y —Y)*] = E[(¥ — my —Tyxl34(X — mx))?] 
= El(Y — my)’] — 2E[TyxT'x'(X — mx))(Y — my)] 
+ E[(lyxlx(X — mx))’]. 


But 


El(TyxPx'(X — mx))(¥ — my)] = PyxTxE[((X — mx))(Y — my)| 
= TyxTy'Txy : 


=TyxTYE[(X — mx)(X — mx) ¥Tyx 


= TyxTVTsPyTyx = TyxTy'Txy : 


When _X is centered, we shall denote P(Y|X) by Y. 


3.4 Gaussian Vectors 


The importance of Gaussian vectors is due to their mathematical tractability, their 
stability with respect to linear transformations, and the fact that their distribution 
is entirely characterized by their mean vector and their covariance matrix. 


We begin by slightly extending the definition of a Gaussian random variable. 
This extension will be useful in the definition of a Gaussian vector: 
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Definition 3.4.1 An extended Gaussian variable X is any real random variable 
with a characteristic function of the form 


ox(u) = exp{imu — $07u*}, (3.56) 


wherem € Rando? € R,. 


The only difference with the standard definition is that a null variance o? is 


allowed, in which case the random variable is (almost surely) a constant. 


Definition 3.4.2 A standard Gaussian variable is a Gaussian variable with mean 
0 and variance 1: X ~ N (0,1). 


Definition 3.4.3 An n-dimensional real random vector X is called a Gaussian 
random vector if the random variable a’ X is an extended Gaussian random vari- 
able for alla € R”. 


Definition 3.4.4 A standard Gaussian vector is a Gaussian vector with mean 
vector 0 and covariance matrix I (the identity matrix): X ~ N (0,1). 


The next result is an immediate consequence of the above definition and of 
Theorem 3.3.13. 
Theorem 3.4.5 Let X be an n-dimensional Gaussian vector with mean vector 
mx and covariance matrix Tx. Let A be an (n x k)-dimensional real matrix, and 
b a k-dimensional real vector. The k-dimensional vector Z = AX + 6 is then a 
Gaussian vector with mean vector 


mz = Amx+b é 
and covariance matrix 


bee 


We now make the connection with the classical definition of Gaussian vectors 
in terms of characteristic functions. 


Theorem 3.4.6 For a real n-dimensional random vector X to be a Gaussian vec- 
tor it is necessary and sufficient that its characteristic function dx be of the fol- 
lowing form: 

ox(u) = exp{iu?my — $u'T xu}, (3.57) 


where mx € R” and where lx is a symmetric and non-negative definite n x n 
matrix. In this case the parameters mx and Ix are respectively the mean vector 
and the covariance matrix of X. 
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Proof. Necessary condition. The characteristic function of a Gaussian vector as 
defined in Definition 3.4.3 is 


Ele™"*] = vz (1), 


where yz is the CF of Z := u?X. The random variable Z being an extended 
Gaussian variable, 


oz(1) = exp{imz — $03}, 


where 


and 
07 := El(u"(X — mx))(u"(X —mx))"] 
=u E[(X —mx)(X — mx)" Ju =u7T xu. 


Therefore, finally, 
ox(u) = exp{iu?m x — $uTT xu}. 


Sufficient condition. Let X be a random vector with characteristic function given 
by (3.57). Let Z = a7X, where a € C”. The characteristic function of the random 
variable Z is 


oz(v) = Elexp{ivZ}] = Elexp{iva’ X}] 


= exp{iv(a™m x) — $v?(a"T xa)}. 


Therefore Z is an extended Gaussian random variable. 


Mixed Moments of Gaussian Vectors 


We shall give two useful formulas concerning the moments of a centered (0-mean) 
n-dimensional Gaussian vector X = (Xj,...,X,)? with the covariance matrix 


First, we have 
EX, Xin, ees Xin, | = S- O51 527 jaja ae Ojon jor 5 (3.58) 
(h1s--J2K) 


I <J2)-J2k-1<J2k 


where the summation extends over all permutations (j1,..., jax) of {t1,..-,¢2x} 
such that 71 < jo,...,Jor-1 < jor. There are 1-3-5...(2k — 1) terms in the 
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right-hand side of Eq. (3.58). The indices i,,...,%2, are in {1,...,n} and they 
may occur with repetitions. For instance 


BIL XaXsXa) = ore 013024 + 014023 


E|X? AG) = oo 012012 + 012012 = Go 2e7, 
BiXy| S37, = 30, 
EX) = 1-3..02k= 10". 


Also the odd moments of a centered gaussian vector are null, that is: 


Dace: 


“al =U; (3.59) 


for all (4),...,¢ax41) € {1,2,...;n}™. 


The proof of the formulas above is required in Exercise 5.7.11. 


Independence and Non-Correlation 


In general, non-correlation does not imply independence. However, this is nearly 
(see Example 3.4.9 below) true in the case of Gaussian vectors. We start with a 
definition in view of correctly stating the announced result. 


Definition 3.4.7 Two random real vectors X and Y of respective dimensions n 
and q are said to be jointly Gaussian if the vector Z defined by 


BP me (A Rig tng Map Liye Y,) 


tq 
is a Gaussian vector. 

Theorem 3.4.8 Two jointly Gaussian random vectors X and Y of respective di- 
mensions n and q are independent if and only if they are uncorrelated (that is 
Ixy = 0). 


Proof. Necessity: If X and Y are independent then, by the product formula for 
expectations, 


E[(X —mx)(Y — my)"] = E[X — mx]JE[Y — my]" =0. 


Sufficiency: If X and Y are uncorrelated the vector Z has for covariance matrix 


_ (Tx 0 
ew os 
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Mx 
Mz = : 
Z my 


It is a Gaussian vector by hypothesis and therefore, with 


and the mean 


W t= (U1)... Un, U1, -- +5); 
we have that 
Elexp{i(u? X + v'Y)}] = Elexp{iw? Z}| 
= expfiw’mz — sw'T 20} 
= exp{i(u’my +0’ my) — sui xu - sul ye} 
= Elexp{iu’ X}]Elexp{iv7 Y}}], 


and the conclusion follows from the factorization theorem of characteristic func- 
tions (Theorem 3.2.20). 


EXAMPLE 3.4.9: GAUSSIAN, UNCORRELATED, NOT JOINTLY GAUSSIAN. Let X 
and U be two independent random variables, where X ~ N(0,1) andU € {—1,1}, 
P(U =+1) = 4. We show that 


YoUX aN G1) 


and therefore X and Y are separately Gaussian. However, we also show that they 
are not jointly Gaussian, and that they are uncorrelated, and yet, not independent. 
The proof of the above statements is as follows: 


P(Y <2) =P(UX <2) =P(U =1,X <2) + P(U =-1,X > —2) 
= P(U =1)P(X <x) + P(U = —-1)P(X > -2) 


) 
1 1 
= gh (x <a)+ gh (x >-2r) = P(X <x). 
Also, E[YX] = E[UX?] = E[U]E[X?] = 0, that is Y and Z are uncorrelated. 
We show that they are not independent. We have P(X? = Y?) = 1. If X and 


Y were independent, since they are absolutely continuous, (X,Y) would admit a 
probability density, say, fx,y(«,y). Then 


P(X? = ye) -| | L{a2=y?} fxy(x,y) dx dy = 0, 
RYR 


since the set {(x,y);2? = y?} has a null area. Hence a contradiction. 
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The reason why Theorem 3.4.8 cannot be applied is that (X,Y) is not a Gaus- 
sian vector. If it were, then X — Y would be an extended Gaussian random 
variable. Obviously X — Y is not a constant. The only case remaining is that in 
which X —Y has a probability distribution, and therefore P(X —Y = 0) = 0. But 


this is incompatible with P(X — Y = 0) = P(U 1) =3. 


Probability Density of a Non-degenerate Gaussian Vector 


A Gaussian vector with a degenerate covariance matrix cannot have a probability 
density (Theorem 3.3.12). However: 


Theorem 3.4.10 Let X be an n-dimensional Gaussian vector with mean vector 
m and non-degenerate covariance matric Tx (in particular, if u'Tu = 0, then 
u=0). Then X admits the probability distribution function 


fx(z) = CAREER SUE exp{ 5(x TG, =i) (3.60) 


Proof. Since y > 0, there exists a non-singular matrix A of the same dimension 
such that .y = AA’. Let Z := A7!(X—m). By Definition (3.4.3), it is a Gaussian 
vector with mean 0 and covariance matrix 


Ty=AlTyA?T =A tAATAT =I 


Therefore its characteristic function is 


Elexp{iu’ Z}] = exp {- 3 «| ‘ 


i=1 


This is the characteristic function of a centered Gaussian vector having indepen- 
dent coordinates, and therefore Z,...,Z, are independent standard Gaussian 
random variables. In particular, the probability density of Z has the form of a 
product: 


= " —$22/2 _ 
=|] oo : exp{—5ll2 7}. 
ea Tarr 


Now, X = AZ+™m and therefore, by the formula for a smooth change of variables, 


1 1 


= WaT yy? Gaye HPL sl ‘(x —m)|?}, 
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and this is precisely (3.60) since 


A“ = m)|? = (Ae — my)” (Ae — mn) 
=m) ATA (c= m) 


Empirical Mean and Variance of the Gaussian Distribution 


A Gaussian sample of size n is, by definition, a random vector X = (X1,..., Xn) of 
ub MN (m, 0?) Gaussian variables. Any random variable of the form f(X,,..., Xn) 
is called a statistic of this sample. The two main statistics are the empirical mean 
ee ee POX 
La 
n 


and the empirical variance 


The perhaps surprising factor —— (instead of +) is motivated by the result of 
Exercise 3.6.36. 


Theorem 3.4.11 The empirical means and the empirical variance of the above 
Gaussian sample are independent and |(n — 1)/07|S? has a chi-square distribution 
with n — 1 degrees of freedom. 


Proof. We first treat the case where m = 0 and o? = 1. For this, we rely on the 
next lemma (Cochran’s lemma). 


Recall that a unitary square complex matrix is one for which the conjugate 
transpose is its inverse. 


Lemma 3.4.12 There exists ann xn unitary matriz C such that if the n-vectors 
x and y are related by y = Ca, then (with the obvious notation) 


Un = VnE and y2 +++» +y2_, = (n— 1)s*, (3.61) 


where 


Fo Te ond 3? = So (ai —Z)?. (3.62) 


n n—-1¢4 
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The random vector Y = C'X is a Gaussian vector, and a standard one since 


ly =CTxC? = CIC’ =CC’ =1 


(the transpose of a unitary matrix is its inverse). According to (3.61) and (3.62) 
Yn 
Jn 


The independence of X and S? then follows from the independence of Y, and 


(%4, cea sXnad): 


= 1 
X= and SP Fe Mea) 


X;—m 


For the general case, apply the above result to the variables X; := == 


(1 <i <n) and observe that X? = 4=” and (S’)? = - 


o 


3.5 Conditional Expectation II 


The difference with the discrete case is that for all y, P(Y = y) = 0, and this calls 
for a new definition, that of conditional probability density. Otherwise, this case 
is completely similar to the discrete case, with integrals replacing sums. 


Definition 3.5.1 Let X and Y be the random vectors of dimensions p and n 
respectively, with joint probability density fx, and let fy be the probability density 
function of Y. Let y € R” be fixed. The function ae : RP? > R defined by 


¥=¥ (2) = fry(2, y) 


frty) ’ 


with the convention f}~"(x) = 0 (or any other arbitrary value) when fy(y) = 0, 
is called the conditional probability density of X given Y = y. 


Note that when fy(y) > 0, fry (a, y= fx *(2) fr(y). 


EXAMPLE 3.5.2: CORRELATED GAUSSIAN VARIABLES, TAKE 1. Let X, and 
X be two random variables with the joint probability density 


The random variable X2 is Gaussian random with mean 0 and variance 03, that is 


toto 


2 
1 1 zy ©1292 x 
C1 X>(@1, 02) = m—— = exp 4 -—s-—mes [| 3 -- 2 + 
fx; ,X2 (1, £2) ay are rf Wp) (3 P Fis 


Sel 3 


1 - 1 23 
fx, (22) = ses exp {-4 =} . 
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We then find that 


x m1) = Ta Joe a {-sardaeyt _ ptr2)} ; 


Note that this is the probability density (in x) of a Gaussian random variable 
with mean p22 and variance o7(1 — p’). 


Oo: 


Definition 3.5.3 Let X and Y be two random vectors of dimensions p and n 
respectively, with joint probability density fxy, and let g : R? x R" > Ry be 
either non-negative or such that E|\g(X,Y)|] < co. One defines the function 
w:R"->R by 


vw =f senfi"@de (3.63) 
Re 
on the set C = {y € R®”; fy(y) > O}, 0 otherwise. For each y € R”, W(y) 
is called the conditional expectation of g(X,Y) given Y = y, and is denoted by 
EY"(g(X,Y)], or Elg(X,Y)|Y¥ =y): 
EY [g(X,Y)] = vy). (3.64) 


The random variable w(Y) is called the conditional expectation of g(X,Y) given 
Y, and is denoted by EY [g(X,Y)] or Elg(X,Y) | Y]. 


The integral in (3.63) is well defined (possibly infinite however) when g is non- 
negative. It remains to check that it is also well defined when g is of arbitrary sign 
and satisfies the integrability condition E||g(X,Y)|] < co. For this we proceed 
just as in the discrete case. First we note that in the non-negative case, we have 
that 


[ voroe= fof enteric) deay 
=f f semtxvemtcl) dar dy 
< ff aeudsv (eu) dedy = Blg(X.¥)]. 


Therefore, if E|g(X, Y)] < co, then 


U(y) fy(y) dy < 00, 
an 
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which implies that w(y) < oo for all y € R” such that fy(y) > 0. In particular 
wW(Y) < co almost surely (that is, P(W(Y) < oo) = 1). Indeed, P(w(Y) = co) = 
J emgyecs fr(y) dy = 0. 


Let now g : R? x R” > Ry, be a function of arbitrary sign such that 
E|\g(X, Y)|] < co, and in particular E[g*(X,Y)] < co. Denote by w* the func- 
tions associated to g* as in (3.63). As we just saw, for all y € C, w*(y) < ~, 
and therefore ~(y) = wt (y) — w(y) is not an indeterminate form oo — oo. Thus 
the conditional expectation is well defined in the integrable case, and moreover 
IBY [9(X, Y)]| < 00. 


Properties of the Conditional Expectation 


The properties will be given without proofs since they are easy adaptations of the 
proofs given in the discrete case, integrals replacing sums. 


The first property of conditional expectation, linearity, is obvious from the 
definitions: For all Ay, A2 € R, 


EY aX, Pe ee A2g2(X, Y)] = ME [n(X, Y)I + AB” [g2(X, Y)] 


whenever the conditional expectations thereof are well defined and do not produce 
oo — oo forms. Monotonicity is equally obvious: if gi(x,y) < go(x,y), then 


E* (91(X, Y)] < E* [g2(X,Y)]. 


Theorem 3.5.4 If g is non-negative or such that E||g(X,Y)|] < co, we have 


E(E™ [9(X,Y)]] = E[g(X,Y)]. 


Proof. Same as in Theorem 2.4.5. 


Theorem 3.5.5 If w is non-negative or such that E||w(Y)]|] < oc, 
E*(w(Y)| =w(Y), (3.65) 
and more generally, 
E*(w(Y)h(X, Y)] = w(Y)E* (A(X, Y)], (3.66) 


assuming that the left-hand side of (3.66) is well defined. 


Proof. Same as in Theorem 2.4.6. 
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Theorem 3.5.6 If X andY are independent and if v is non-negative or such that 
E||v(X)|] < co, then 
E*[v(X)] = Elv(X)]. 


Proof. Same as in Theorem 2.4.7. 


Theorem 3.5.7 If X andY are independent and if g : FxG —> R is non-negative 
or such that E|\g(X, Y)|] < oo, then, for all y € G, 


Elg(X,Y |Y =y] = Elg(X,y)]- 


Proof. Same as in Theorem 2.4.8. 


We now give the successive conditioning rule. Suppose that Y = (Yi, Y2), where 
Y; and Y3. In this situation, we use the more developed notation 


E*[g(X,Y)] = BY? [g(X, V1, Ya]. 


Theorem 3.5.8 Suppose that Y = (Yi, Y2) as above. If g is non-negative or such 
that E||g(X, Y)|] < oo, then 


BYE? 19(X, Yi, Yo)]] = £? [9(X, Mi, Yo)]- (3.67 


Proof. Same as in Theorem 2.4.9. 


EXAMPLE 3.5.9: CORRELATED GAUSSIAN VARIABLES, TAKE 2. In the situation 
of Example 3.5.2, we have 


. 0 
E*2X\] = p—X2. 
02 
This follows from the remark at the end of the Example 3.5.2, because 


E*?(Xq] = (X2) 


where 
— o 
(22) = wife. (a )da, = p—22. 
Rp 02 
Similarly, 
2 
; or 
E*[Xq] = of (1 — 9”) + PaXa 
Indeed 


E(X7] = 7(X2) 
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where 
nia) = foto ae, 
RP 


This is the second moment of a Gaussian random variable of mean p2iv2 and 


variance o?(1 — p?), and therefore 


a (2) = 03(1 =p?) + (ote) 


2 


Bayesian Tests of Hypotheses 


Let © be a discrete random variable with values in {1,2,...,} and let X be a 
random vector with values in R™”. The joint distribution of O and X is specified 
as follows: 


P@=)=n), P(Xecile=i)=/ fwar (sisk), 
C 
where the f;’s are probability densities on R™. 


The interpretation in terms of tests of hypotheses is the following. The random 
variable © represents the state of Nature, and X — called the observation — is 
the (random) result of an experiment that depends on the actual state of Nature. 
If Nature happens to be in state i, then X admits a distribution with probability 
density f;. 


In view of the observation X, we wish to infer the actual value of 0. For 
this, we design a guess strategy, that is a function g: R™ > {1,2,..., A} with the 
interpretation that O:= g(X) is our guess (based only on the observation X) of the 
(not directly observed) state © of Nature. An equivalent description of the strategy 
g is the partition A = {Aj,..., Ax} of R™ given by A; := {a € R™; g(x) = +}. 


The decision rule is then 
XE€A;=> 6 =i. 


The probability of error associated with this strategy is, by the Bayes rule of total 
causes, 


Pp(A) = P(0 46) =P (6410-1) r(i) 


=) > P(X ¢ AO =1)x(i). 


i=1 
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Equivalently, the probability of correct decision is 
K 
1 — Pg(A) = 5° P(X € AiJO = 1) x(2) 


i=1 


7 I (>: miles io) da. 


i=l 


The following result is then obvious in view of the above expression for the 
probability of correct decision: 


Theorem 3.5.10 Any partition A* such that 
we € Al = m(i) fix) = max (w(K) fe()) 


minimizes the probability of error. 


EXAMPLE 3.5.11: TWO GAUSSIAN HYPOTHESES, TAKE 1. In this example, the 
hypotheses are Gaussian. More specifically, there are two equiprobable Gaussian 
hypotheses: Nature chooses its state © equiprobably in {1,2}, and the observa- 
tion X is a Gaussian random variable, and X ~ N(m;,07) (i = 1,2). The two 
hypotheses differ only by the mean of the observation. Since the hypotheses are 
equiprobable, an optimal strategy is 


o~ 


fA(X)>fh(X)s6=1, A(X) < A(X) 36 =2. 


Since 


the optimal rule is 


an 


(X —m) < (X —m)? > @=1. 
Equivalently, supposing that m, < m2, 


+ 4 a 
x< Oo 6-1. 


The probability of error can be expressed as 


Py(A) = 5° P(X € A;|O = 4) r(i) = > Pe(A) 


i=1 
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where Pp,(A) is the probability of making a wrong decision when Nature is in 
state 7. 


EXAMPLE 3.5.12: TWO GAUSSIAN HYPOTHESES, TAKE 2. This is the continu- 
ation of Example 3.5.11. The probability of error Pr is given by 


ms: — ™m 
Pe=(! ae Hy. 


where the function Q is the tail of the standard normal distribution: 


Proof. We evaluate Pp,, the probability of error when X ~ N(m,, 07) supposing 


that m, < mo: 
1 foe) 


ny +my 
= 


By symmetry, Pe, = Ps, = Pr. Therefore, with X; ~ N(m,, 07), and observing 
that X, then has the same distribution as the variable oZ+m,, where Z ~ N(0, 1), 


Pe=P(xi> 5") =P (oz +m > 5") 


=P(z> ma -9(" . 
20 20 


Let now the observation be a discrete random variable taking its values in some 
finite set EL, and suppose that 


fi(x) = Pr(X =c2l/O =i) (i =1,2). 


The result of the continuous observations case applies mutatis mutandis. Any 
partition A* of E such that: 


2 € At = a(i) fil) = max (m(k) fu(2)) 


minimizes the probability of error. 
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EXAMPLE 3.5.13: THE BINARY CHANNEL WITH FLIP NOISE. In this exam- 
ple E = {0,1}”".The addition 6 defined on F being the componentwise addition 
modulo 2, the observation is X = me @ Z where 


m, = (m,(1),...,m;(n)) € {0,1}", 7Z=(X%,...,2Zn), 


where Z and © are independent, the Z;’s (1 < i < n) are independent and iden- 
tically distributed with Pr(Z; = 1) = p. A possible interpretation is in terms 
of digital communications, when one wishes to transmit the information © chosen 
among a finite set of “messages” which are binary strings of length n: mj1,...,mx. 
The vector Z is the “noise” inherent to all digital communications channels: if 
Zy = 1 the k-th bit of the message O is flipped. In the simplest model, this error 
occurs with probability p, independently for all the bits of the message, and the 
hypotheses are equiprobable. One may suppose without loss of generality that 
p< 3. We have: 


P(X =2|0 =1)=P(ZOm=2) =P(Z=m Oz). 


Denoting by h(y) the Hamming weight of y € {0,1}”" (equal to the number of 
components of y that are equal to 1), and by 


d(x,y):= > Lteityi} = > x, OY: = h(x Py) 
i=1 i=1 


the Hamming distance between x and y in E”, we have 


P(Z =y) = pe™ (1 — pyre 


l—p l—p 


Therefore 


Therefore the optimal strategy consists in choosing the hypothesis corresponding 
to the message closest to the observation in terms of the Hamming distance. 
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3.6 Exercises 


Exercise 3.6.1. SUM OF IID UNIFORM VARIABLES 

A point inside the unit square [0, 1]? = [0, 1} x [0, 1] is chosen at random according to 
the following model: Q = [0, 1]?, P(A) = area of A. In other words, w = (x,y) € 2 
is a point uniformly distributed on the unit square. Let X(w) := x and Y(w) =z. 


1. Compute the probability density function of Z7= X+Y. 
2. Compute E[Z?]. 


Exercise 3.6.2. UNIFORM DISTRIBUTION ON A DISK 

Consider the following probability model: Q = {(a, y) € R?, a7+y? < 1}, P(A) = 
+x (area of A). In other words, w = (x,y) € Q is a point uniformly distributed 
on the unit disk. Letting X(w) := x and Y(w) = x, show that X and Y are not 
independent random variables. 


Exercise 3.6.3. SQUARE ROOT OF A RANDOM VARIABLE 

Let X be a non-negative real random variable with probability density function 
fx. What is the probability density function of Z, the non-negative square root 
of X? 


Exercise 3.6.4. QUANTIZATION NOISE 

In the digital world, measurements are not recorded in continuous form, but in 
quantized form. For instance, a random variable X taking its values in the range 
(0, +A] will be recorded as Y = iA if X € [iA, (i+1)A), where A = 4. Therefore 
there are 2” possible values for Y, and it is then said that X has been quantized on 
n bits. In the applied literature, the error X — Y € [0, A) is often assumed to be 
uniformly distributed on this interval. Compute its variance under this (generally 


wrong) assumption. 


Exercise 3.6.5. CAUCHY DISTRIBUTION 
(a) Show that the characteristic function of a Cauchy random variable (that is, 
with the probability density function f(a) := so) is Yx(u) =e". 


(b) Let {X,}n>1 be a sequence of independent Cauchy random variables. Let T 
be a positive integer-valued random variable, independent of this sequence. Define 
Y= oe X,. What is the probability distribution of Z = x? 


Exercise 3.6.6. CONTINUOUS x DISCRETE 
1) Let X be a real-valued random variable with the probability density function 
fx. Let Y be a positive integer-valued random variables (Y € {1,2,...}) with the 
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distribution P(Y =k) = pz, k > 1. Suppose that X and Y are independent. 
Show that the random variable Z = XY is absolutely continuous and give its 
probability density function fz. 

2) Consider the same setting as in 1) except that Y may take the value 0, with 
positive probability pp. What is the cumulative distribution function of Z? 


Exercise 3.6.7. CONTINUOUS + DISCRETE 

Let X be a real-valued random variable with probability density function fy (x) 
and let Y be an integer-valued random variable with distribution P(Y = k) = pp, 
k > 0. Suppose that X and Y are independent. Show that the sum Z = X+Y isan 
absolutely continuous random variable, and give its probability density function. 


Exercise 3.6.8. HAZARD RATE, I 
The hazard rate function \ : N - [0,1] of an integer-valued function X is defined 
by A(n) = P(X = n|X > n). 


(i) Compute P(X > n) and P(X =n) in terms of A(0),--- , A(n). 


(ii) Let {Un}n>o0 be a sequence of 11D random variables uniformly distributed on 
(0, 1]. Show that the random variable Z := min{n > 0: U,, < A(n)} has the same 
distribution as X. 


Exercise 3.6.9. HAZARD RATE, II 
Let F' be the CDF of a non-negative random variable with a probability density 


function f. Let I := [—oo, to) (to possibly infinite) be the set of t € Ry such that 
F(t) < 1. Define, for t € I, the hazard rate 
f(t) 
A(t) = 
6) 1— F(t) 


1. Show that for t € J, 
f(t) =A(t) eo 48 | 


2. Compute the hazard rate of the exponential variable. 


Exercise 3.6.10. MORE HAZARD RATES 
Alle 1. Let 7, and T> be two non-negative random variables admitting a probability 


density function and with respective hazard rates (see Exercise 3.6.9) Aj (t) and 
Ag (t). What is the hazard rate of T = min(7\, 72)? 


2. Show that the property P (Tz > t) = P(T, > t)® (for some a > 0) is equivalent 
to: Ag (t) = ar; (t). 
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Exercise 3.6.11. cos(®) 
Let ® be a random variable uniformly distributed on the interval [0, 27] and define 
X =cos(®). Compute the mean and the variance of X. 


Exercise 3.6.12. MAXIMUM OF IID VARIABLES 

Let X 1, X2,...,X, be independent random variables uniformly distributed on 
(0, 1], that is to say, with the probability density f(a) = 1jo,1;(a). Compute the 
expectation of Z = max(Xj,..., Xn). 


Exercise 3.6.13. GAUSSIAN MEAN AND VARIANCE 
Let 0, mE R,o > 0. 


i) Prove that Ie e-3? dr = 27m, and deduce that f(x) = eg is a probability 
density function on R. ” 


ii) Prove that 


1(x-—m 


2 
ae 24 7) dx =m. 


1 
Tana? Jn 
iii) Prove that Tet In ( a—m) 
Exercise 3.6.14. SQUARE OF A GAUSSIAN VARIABLE 

Let X be a real random variable with the probability density function fx(x) = 


aa e 7 Compute the probability density function fy of Y = X?. 

Exercise 3.6.15. TWO RANDOM NUMBERS IN (0, 1] 

Two numbers are drawn independently and completely at random on [0,1]. The 
smaller is larger than z. Given this information, what is the probability that the 
larger number exceeds 3 


Exercise 3.6.16. COUNTEREXAMPLE 

Give a simple example showing that the cumulative distributions of each coordinate 
of a random vector does not completely describe the probabilistic behavior of the 
whole vector. 


Exercise 3.6.17. POLAR COORDINATES 

Let (X,Y) be a random vector uniformly distributed on D\{0}, the closed unit 
disk of R? centered at the origin without the origin. Let (Z,©) be its polar 
coordinates (Z € (0, 1], © € (0,27]). Show that Z and © are independent. 


Exercise 3.6.18. QUOTIENT OF UNIFORM RANDOM VARIABLES 


Let X, and X, be two independent random variables uniformly distributed over 
(0, 1]. Find the probability density function of X,/Xo. 
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Exercise 3.6.19. PRODUCT OF UNIFORM VARIABLES 


Let U and V be two independent random variables uniformly distributed on [0, 1]. 
Show that the variable Z = UV has a probability density and compute it. 


Exercise 3.6.20. RANDOM ROOTS 

The numbers A and B are selected independently and uniformly on the segment 
[—1,+1]. Find the probability that the roots of the equation 2? + 2Aa + B are 
real. 


Exercise 3.6.21. ISN’T THIS PUZZLING? 

Some guy uses his wild imagination to preselect two different numbers, and he 
does not tell you which ones. Then he chooses one of the two preselected numbers 
at random (probability $, $). He shows this number to you and asks you to guess 
if it is the largest of the 2 numbers he preselected. Are you interested in playing 
(meaning: do you think that you have a better guess than a random guess (yes- 
no probability s, 3)? Hint: you might fix for yourself a “reference number”, and 
compare it with the showned number. 


Exercise 3.6.22. INFIMUM OF INDEPENDENT EXPONENTIALS 

Let X1,...,X, be independent exponential random variables with the respective 
parameters \;, 7 € [0,n]. Define Z = inf(X1,...,X,) and let J be the (random) 
index such that X; = Z (J is for almost all w € Q unambiguously defined be- 
cause, P-almost surely, X1,...,Xn take different values). Show that Z and J are 
independent, and give their respective distributions. 


Exercise 3.6.23. RANDOM SUM OF EXPONENTIAL VARIABLES 

Let {Xn}n>1 be a sequence of IID exponential random variables with common 
mean A~! > 0, and let T be a geometric random variable with mean p~! > 0, 
and independent of the above sequence. Show that Z := X,+---+ Xp admits a 
probability density function. Which one? 


Exercise 3.6.24. SUM OF IID EXPONENTIALS 


Let {X,}n>1 be an IID sequence of exponential random variables with mean 1/0, 
where 0 € (0,00). What is the distribution of 7 = X,+---+ X,? 


Exercise 3.6.25. CHARACTERISTIC FUNCTION OF Y = AX +b 

Let wx(u) be the characteristic function of the random vector X. What is the 
characteristic function of the random vector Y = AX +), where A is a matrix and 
b is a vector of appropriate dimensions? 
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Exercise 3.6.26. CHARACTERISTIC FUNCTION OF THE MULTINOMIAL RANDOM 
VECTOR 

Let (X,,..., X;) be a multinomial random vector of size k and parameters p1,..., Dz 
(pi > 0, pi +-+-+px = 1). Compute the characteristic function of (X1,...,X,—1). 


Exercise 3.6.27. PRODUCT OF UNIFORM VARIABLES 
Let U;,...,U, be independent uniform random variables on [0,1]. Give the cdf of 
the random variable U; x U2 x --- x U,. (Hint: logarithms and Exercise 3.6.24.) 


Exercise 3.6.28. QUOTIENT OF EXPONENTIAL RANDOM VARIABLES 
Let X, and X2 be two independent random variables with a common exponential 
distribution of mean 6~'. Give the probability density function of the variable 


parece 


Exercise 3.6.29. CORRELATION COEFFICIENT 

Let X and Y be square-integrable random variables. Let a, b, c, d be real numbers, 
a#0,d#0. Give the correlation coefficient of aX + b and cY +d in terms of the 
correlation coefficient pxy of X and Y. 


Exercise 3.6.30. SIGNAL PLUS NOISE ON TWO CHANNELS 

Let Y, 21, Z2 be square-integrable centered real random variables, and suppose that 
Y is independent of Z, and Z. A useful interpretation is that Y represents an 
informative “signal” that is observed via two channels, one producing the observa- 
tion Y + Z, and the other the observation Y + Z2, where Z, and Z» are considered 
as “noises”. The following questions are then natural. 


What is the best linear-quadratic estimate of Y in terms of (Y +2), Y + Z2)? Give 
the minimum quadratic error. Write this error in the following particular cases: 
(a): Z, and Z» are uncorrelated, and (b): 7; and Z have the same variance. 


Exercise 3.6.31. COVARIANCE MATRIX OF THE MULTINOMIAL VECTOR 
Compute the covariance matrix of a multinomial random vector of size k and with 
parameters p, ..., Dr. 


Exercise 3.6.32. AUTOREGRESSIVE GAUSSIAN MODEL, TAKE 1 
Consider the stochastic sequence {X;,},,.9 defined by 


Xn+i = aXn + €nt1 (n 2 0) ) 


where Xo is a Gaussian random variable of mean 0 and variance c’, and {€,,},,s9 is 
a sequence of IID Gaussian variables of mean 0 and variance 0”, and independent 
of Xo. 
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1. Show that for all n > 1, the vector (Xo,..., X,) is a Gaussian vector. 
2. Express X,, in terms of Xo,€1,...,€n (and a). Give the mean and variance of 


Xn: 


Exercise 3.6.33. PROBABILITY OF THE QUADRANT 
Let (X,Y) be a 2-dimensional Gaussian vector with probability density 


f (x, y) = mae exp { TC (e 2pxry yy} ? 


where |p| < 1. Show that X and (Y — pX)/(1- py? are independent Gaussian 
random variables with mean 0 and variance 1. Deduce from this that 


P(X >0,Y >0)=4+¢sin“'(p). 


Exercise 3.6.34. a AND = 
Let X and Y be two independent Gaussian random variables with mean 0 and vari- 
ance 1. Show that the random variables a and as are independent Gaussian 


random variables, and give their means and variances. 


Exercise 3.6.35. QUOTIENT OF y? DISTRIBUTIONS 
Let X and Y be two independent random variables such that 


X~ x? andY~ y?,. 


Compute the probability density function of (7,Y) where Z := * and deduce 


Y 
from the result the probability density function of Z. 


Exercise 3.6.36. UNBIASEDNESS OF THE EMPIRICAL VARIANCE 
Let {X,}n>1 be an MID sequence of square integrable random variables with mean 
@ and variance o?. Show that the variance estimate 


Ao oe (Xi — Ov)? 
N 


a n—-1 : 


where Oy := + yo, Xi, is unbiased, that is E [o?,] = 0?. 


Exercise 3.6.37. X, — X2 

Let X, and X»2 be two independent random variables admitting the probability 
density functions f; and fo respectively. What is the probability density function 
of X, — Xo? 
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Exercise 3.6.38. CUMULATIVE DISTRIBTION FUNCTIONS 

Let X be a real-valued random variable with CDF F’, and let g be a strictly 
increasing function. Find the cpFs of the random variables X2, VX (X assumed 
non-negative), F(X), g-t(X) and g-(F(X)). 


Exercise 3.6.39. SUM OF IID EXPONENTIALS 

Let X,,...,X, be UD exponential random variables with mean \~'. Give the 
characteristic function of X,+---+ X,, and deduce from the result its probability 
density function. 


Exercise 3.6.40. (S,7T'— S) 

Let {Xn}n>1 be independent random variables taking the values 0 and 1 with 
probability ¢ = 1 — p and p, respectively, where p € (0,1). Let T be a Poisson 
random variable with mean 6 > 0, independent of {X,,}n>1. Define 


S=Xy+---4+ Xr. 


Compute the characteristic function of the vector (5,7 — S). Deduce from this 
that S and T—S are independent Poisson random variable with respective means 
pé and q@. 


Exercise 3.6.41. PROBABILITY DENSITY FUNCTION OF X, — X29 

Let X; and X» be two independent random variables admitting the probability 
density functions f; and fg respectively. What is the probability density function 
of X1 = Xo? 


Exercise 3.6.42. THE FIRST BOX 


Let X = (X,:--,Xx) be a multinomial vector of size (n, A) and parameters 
Pi,---;PK- Show that X, is a binomial random variable of size n and parameter 
Pi- 


Exercise 3.6.43. SUM OF MULTINOMIALS 

Let X = (Xq,...,X,) and Y = (%,..., Y,) be two independent multinomial ran- 
dom vectors of sizes (n, A’) and (m, kK’), respectively, and with the same parameters 
Pi,---;PK- What is the distribution of Z = X + Y? 


Exercise 3.6.44. POISSON COVARIANCE MATRIX 
Let Z1, Z2,..., Z, be independent Poisson random variables with respective means 
0, 02, oman » Opes Let 


Give the covariance matrix of X = (Xj,..., Bas 
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Exercise 3.6.45. UNCORRELATED, YET DEPENDENT 
Let X and Y be IID random variables with the equiprobable values 0 or 1. Show 
that X + Y and |X — Y| are uncorrelated, yet dependent. 


Exercise 3.6.46. USELESS INFORMATION 
In the theory of Bayesian tests of hypotheses of Section 3.5 (page 129), suppose 
that the observation is of the form X = (Y,Z) € R™ = R"*? where Y € R” 
and Z € R?, and that under each hypothesis O = i, the probability density of X 
admits the factorization 

fily, 2) = gily)h(z), 
where g; and hare probability densities on IR” and R? respectively. Show that the 
optimal test based on X does not use the information on Z, and is the same as 
the optimal test based on Y alone. 


Exercise 3.6.47. Two GAUSSIAN HYPOTHESES, TAKE 3 

We consider a Bayesian test of hypotheses with two equiprobable hypotheses. The 
observation X € R™ is a random vector. For 7 = 1,2, X ~ N(m;,T) where I is 
an invertible covariance matrix (the two hypotheses differ only by the mean of X). 
Describe the optimal Bayesian test of hypotheses in this situation. Give details 
for the case where the coordinate of the observation vector X are independent 
and identically distributed. In the latter case, compute the probability of error. 
Compare with Examples 3.5.11 and 3.5.12. 


Exercise 3.6.48. BAYESIAN TEST AND VARIATION DISTANCE 
(a) Let X and Y be two absolutely continuous random variables with the respective 
probability densities f and g. Define their distance in variation by 


dy (X,Y) := sup(P(X € A) — P(Y € A)). 
AER 
Show that 


1 
av(XY)=5 f |fle) -gle)]ae. 
R 
(b) In the Bayesian test with an observation X € R” and two equiprobable hy- 


potheses for the probability density function of the observation: f; and f2, compute 
the probability of error of the optimal test. 


Check for 
updates 


Chapter 4 


The Lebesgue Integral 


The previous chapters concerned what one may call the basic “calculus of prob- 
ability”, that is, the acquisition of the skills that suffice to deal with elementary 
stochastic models involving discrete random variables and absolutely continuous 
random vectors. This chapter will considerably increase the expertise of the reader 
at the expense of a reasonable amount of abstraction. It contains a short sum- 
mary of the abstract Lebesgue integral that will then be interpreted in probabilistic 
terms in the next chapter. 


4.1 Measurable Functions and Measures 


o-fields 


Denote by P(X) the collection of all subsets of an arbitrary set X. Recall the 
definition of a o-field: 


Definition 4.1.1 A family X C P(X) of subsets of X is called a o-field on X if: 
(a) X EX; 
(B) AEX = AEA; 
(y) An € ¥ for aline N = > UMA, E #. 


One then says that (X, V) is a measurable space. 


Also recall: 


Definition 4.1.2 The o-field generated by a non-empty collection of subsets C C 
P(X) is, by definition, the smallest o-field on X containing all the sets in C. It is 
denoted by o(C). 
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The Borel o-field on R”, B(IR”), already briefly introduced in the first chapter, 
receives a convenient definition in terms of the Euclidean topology. 


First recall that a set O C R” is called open if for any x € O, one can find a 
non-empty open ball centered on x and contained in O. 


Definition 4.1.3 The Borel o-field BUR”) on R” is, by definition, the o-field 
generated by the open sets of R”. 


The next result gives a more convenient way of defining the Borel o-field, of 
the type given in the first chapter. 


Theorem 4.1.4 The o-field B(IR") is also generated by the collection C of all 
rectangles of the type [[;_,(—co, ai], where a; € Q (the rationals) for all i € 
iL sce otil}te 


Proof. Exercise 4.5.4. 


Definition 4.1.5 B(R) is, by definition, the o-field on R := IR” U {+00, —0o} 
generated by the intervals of type (— 00, a] (a € R). 


It can be readily checked that it consists of the collection of sets of the form 
A, AU {+00}, AU {—co}, AU {+00,-co} (A € B(R)) 


Measurable Functions 


This is the first fundamental notion of Lebesgue’s integration theory. 


Definition 4.1.6 Let (X,%) and (E,€) be two measurable spaces. A function 
f :X — E is called a measurable function with respect to ¥ and E if 


fU(C):={x@ EX; f(z) Ee Chew forall Ce€. 
This is denoted by 
fi (X,X) (EE) or feé/X. 
A function f : (X,4%) > (R,B(R), where (X, 4) is an arbitrary measur- 


able space, is called an extended measurable function. Functions f : (X,¥) > 
(R, B(R)) are called real measurable functions. 
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Definition 4.1.7 A measurable function f : (X,4) > (R, B(R)) of the type 


k 


fle) = S_ a 1a(2), (4.1) 


i=1 


where k € Nx, a,...,a% € R, Aj,..., Ax € ¥, ts called a simple measurable 
function (defined on X ). 


It seems difficult to prove measurability since most o-fields are not defined 
explicitly (see the definition of B(R”) for instance). However, the following result 
renders the task feasible. 


Theorem 4.1.8 Let (X,%) and (E,€) be two measurable spaces, where E = o(C) 
for some collection C of subsets of E. Then f : (X,%) > (E,€) if and only 
a f +(C) € & for all C EC. 


Proof. We shall first make two obvious preliminary observations. Let X and E 
be arbitrary sets, f : X — F an arbitrary function from X to FE, G an arbitrary 
o-field on FE, and let C,C,,C2 be arbitrary non-empty collections of subsets of F. 
Then 


(i) 0G) =G, 
(ii) Cy C Cy > 0 (C1) C a (Ca). 


Now, the collection G := {C C E; f-'(C) € X} is a o-field and, by hypothesis, 
C CG. Therefore, by (ii) and (i), € = a(C) C o(G) =G. 


An immediate application of this result and Theorem 4.1.4 is: 


Corollary 4.1.9 Let (X,%) be a measurable space and let n > 1 be an integer. 
Then f = (fi,---,fn) : (X, ¥%) > (R", B(R”)) if and only if for alli (1 <i<n), 
{fi <a} for alla; €Q. 

A function f : R* > R”™ is said to be continuous if the inverse image of an 
open set is open,! that is, for all open sets O,, C R™, the set {x € R*; f(x) € Om} 
is an open set of R*. The following result is then a direct application of Theorem 
4.1.8 in view of the Definition 4.1.3 of BUR”). 


It follows from this definition of continuity and Theorem 4.1.8 that 


1 This definition is equivalent to the usual ¢— 6 definition of continuity, which we shall admit 
here. 
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Corollary 4.1.10 Any continuous function f : R® 4 R™ is measurable with 
respect to B(IR*) and B(IR™). 


Another nice feature of the notion of measurability is its stability under com- 
position. 


Theorem 4.1.11 Let (X,%), (Y,Y) and (E,€) be three measurable spaces, and 
let: (X,X) > (YY), 9: (VY) > (E,€). Then goo: (X,X) > (B,€). 


Proof. Let f = g0¢@ (meaning: f(x) = g(¢(x)) for all € X). For all CEE, 
FNC) =o (GC) =o "(DJ EX, 


because D = g~!(C) is a set in Y since g € E/Y, and therefore ¢-!(D) € & since 
PE Y/X. 


Corollary 4.1.12 Let y = (Y1,...,n) be a measurable function from (X, 4) to 
(R", B(R”)), and let g: R" > R be a continuous function. Then god: (X,*%) > 


(R, B(R)). 


Proof. Follows directly from Theorem 4.1.11 and Corollary 4.1.10. 


This corollary in turn allows us to show that the elementary operations (addi- 
tion, multiplication and quotient) preserve measurability. 


Corollary 4.1.13 Let y1, ~. : (X,4) > (R,B(R)), and let X € R. Then vy x 
2, Yi + Ya, A¥1, (Y1/¥2)1y.40 are real measurable functions. Moreover, the set 
{yi = Y2} ts a measurable set. 


Proof. For the first three functions, take in the previous corollary g(a1, 72) = 
U1 X 2, = 11 + XQ, = Ax, successively. 


loo #0 


For (y1/¢2)1,,40, define yy = , check that the latter function is measur- 
able, and use the just proven fact that the product y)~»2 is then measurable. 


Finally, {y~, = yo} = {~1 — Yo = 0} = (1 — Ye) 1({0}) is a measurable set 
since (1 — ~2 is a measurable function and {0} is a measurable set. 


Finally, and most importantly, taking limits preserves measurability. By con- 
trast, it is far from being true that limits of continuous functions are continuous 
functions. 
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Theorem 4.1.14 Let f, : (X,¥) > (R,B(R), n € N. Then liminfrtoo fn and 
lim supptoo fn are measurable functions, and the set 


{lim sup fr = J jah = (2 te lint 
ntoo ny{co ny{co 


belongs to XY. In particular, if {Slimpto fr} = X, the function limps. fr is a 
measurable function. 


Proof. We first prove the result in the particular case when the sequence of 
functions is non-decreasing. Denote by f the limit of this sequence. By Theorem 
4.1.8 it suffices to show that {f <a} € & for all a € R. But since the sequence 
{fn} ps1 is non-decreasing, we have that {f < a} =N2, {fn < a}, which is indeed 
in X, being a countable intersection of sets in ¥. 


Now recall that, by definition, 


liminf f, = lim gn, 
ntoo ntoo 

where g, = infgsn f,. The function g, is measurable since for all a € R, 

{infisn fr < a} is a measurable set, being the complement of {infy>n f, > a} = 

Ak>n{ fe > a}, a measurable set, being the countable intersection of measur- 

able sets. Since the sequence {gn}n>1 is non-decreasing, the measurability of 

lim infrtoo fn follows from the particular case of non-decreasing functions. 


Similarly, lim supytoo fn = —liminfntoo(—fn) is measurable. 


The set {limsup,;.. fn = liminfntoo fn} is the set on which two measurable 
functions are equal, and therefore, by the last assertion of Corollary 4.1.13, it is a 
measurable set. 


Finally, if limntoo fn exists, it is equal to limsup,;., fn, which is, as we just 
proved, a measurable function. 


The results above give substance to the assertion that “basically all functions 
are measurable”. However, beware! One can prove (not in this book) that there 
exist functions from R to R that are not measurable with respect to B(R). Hence 
all the fuss. In fact, there are subsets of R that are not in B(R). 


The basis of the construction of the Lebesgue integral is the following funda- 
mental approximation theorem. 


Theorem 4.1.15 Let f : (X,%) — (R, B(R) be a non-negative measurable func- 
tion. There exists a non-decreasing sequence { fn}n>1 of non-negative simple mea- 
surable functions that converges pointwise to f. 
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Proof. Take 
n2-"-1 
fr(z)= S> k2™ 1a, (@) + nla, (2), 
k=0 
where 


Agn ={u EX :k2" < f(x) < (kK4+1)2}, A, = {x EX: f(x) > n}. 


This sequence of functions has the announced properties. In fact, for any 7 € X 
such that f(x) < oo, and n large enough, 


[f(x) — fr(x)| <2, 


and for any x € X such that f(x) = co, f,(x) =n indeed converges to f(x) = +00. 


Measure 


Definition 4.1.16 Let (X,4) be a measurable space and let ju: X — [0,00] be a 
set function such that 4(@) = 0 and such that for any countable family {An }n>1 
of mutually disjoint sets in X, 


pe (Up An) = D7 (An) (4.2) 


The set function p is called a measure on (X,%V), and (X,#, 1) is called a mea- 
sure(d) space. 


Property (4.2) is the sigma-additivity property. 


EXAMPLE 4.1.17: THE DIRAC MEASURE. Let a € X and let ¥ be an arbitrary 
o-field on X. The measure ¢, defined on (X, 4’) by eq(C) = 1c¢(a) is called the 
Dirac measure at a € X. The set function ys: XY — [0,00] defined by 


where a; € Rx for all i € N, is a measure on (X, 4’) denoted by 77°, ai€a;- 


EXAMPLE 4.1.18: WEIGHTED COUNTING MEASURE. Let {Qn }n>1 be a sequence 
of non-negative numbers. The set function pz: P(Z) — [0, co] defined by p(C) := 
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denec Gn is a measure on (Z,P(Z)). When a, = 1, it is called the counting 
measure on Z. 


EXAMPLE 4.1.19: THE LEBESGUE MEASURE. The measure @ on (R, B(R)) such 
that 
(a, ]) =b—a 


is called the Lebesgue measure on R. Measure theory tells us that there exists one 
and only one such measure. (See the more general result below, Theorem 4.1.27.) 


The proofs of existence and uniqueness of measures are in general not given. 
They are usually very technical and tedious, and their omission has no bearing in 
the rest of the book. See Theorem 4.1.27 and the comment following it. 


Definition 4.1.20 Let yu be a measure on (X,*). If u(X) < 00 the measure pu is 
called a finite measure. If u(X) = 1 the measure ju is called a probability measure. 
If there exists a sequence {Ky}n>1 of X such that (Kp) < co for alln > 1, and 
Ur Ky, = X, the measure p is called a sigma-finite measure. A measure lL on 
CR", BUR")) such that (C) < co for all bounded sets in B(IR”) is called a locally 
finite measure. 


For instance, the Dirac measure €, is a probability measure, the counting mea- 
sure vy on Z is a sigma-finite measure, the Lebesgue measure is a locally finite 
measure, and any locally finite measure on (R”, B(IR")) is sigma-finite. 


The following result is the sequential continuity theorem for measures. 


Theorem 4.1.21 Let (X,%,) be a measure space. Let {An}n>i1 be a sequence 
of X&, non-decreasing (that is, An C Ans for alln > 1). Then 


M(Una1 An) = limntoo (An) - (4.3) 
Proof. The proof is the same as that of Theorem 1.2.8. 


p-negligible sets 


The notion of a negligible set of Definition 1.2.10 will be repeated in a more general 
setting. 


Definition 4.1.22 Let (X,¥,,) be a measure space. A ji-negligible set is a set 
contained in a measurable set N € X such that u(N) = 0. One says that some 
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property P relative to the elements x € X holds y-almost everywhere (1-a.e.) if 
the set {a € X : x does not satisfy P} is a pu-negligible set. 


For instance, if f and g are two measurable functions defined on X, the ex- 
pression 


f<g wae. 
means that pu({a: f(a) > g(x)}) =0. 


Theorem 4.1.23 A countable union of ~-negligible sets is a js-negligible set. 


Proof. Same proof as for Theorem 1.2.11. 


EXAMPLE 4.1.24: CONTINUOUS FUNCTIONS. We show that two continuous 
functions f,g: IR — R that are ¢-a.e. equal, are in fact everywhere equal. 


Proof. Let t € R be such that f(t) A g(t). For any c > 0, there exists an 
s € [t—c,t +c] such that f(s) = g(s) (Otherwise, the set {t; f(t) 4 g(t)} would 
contain the whole interval [t—c, t+ c], and therefore could not be of null Lebesgue 
measure. Therefore, one can construct a sequence {tn}nd1 converging to t and 
such that f(t,) = g(t,) for alln > 1. Letting n tend to oo yields f(t) = g(t), a 
contradiction. 


Cumulative Distribution Function 


Definition 4.1.25 A function F : R > R is called a cumulative distribution 
function (CDF) if the following properties are satisfied: 


1. F' is non-decreasing; 
2. F is right-continuous; 
3. F admits a left-hand limit, denoted by F(x—), at alla € R. 


EXAMPLE 4.1.26: THE CDF OF A MEASURE. Let yp be a locally finite measure 
on (R, B(R)) and define 


be 


_ J+u((0,4]) if t= 0, 
Bult) = ee ift <0. 
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This is a cumulative distribution function (CDF), and moreover, 


Fb) — Fula) = w((a, 8), 
F(a) — Fy(a-) = w({a}). 
The proof that this function is indeed a CDF follows the same lines as the proof of 
Theorem 3.1.4. The function Fj, is called the CDF of wu. 


Theorem 4.1.27 Let F: R— R be a cpr. There exists a unique locally finite 
measure on (R,B(R)) such that Fy, = F. 

The last result is easily stated, but it is not trivial, even in the case of the 
Lebesgue measure (Example 4.1.19). It is typical of the existence and uniqueness 
results which answer the following type of question: 


Let C be a collection of subsets of X with C C #, where ¥ is a o-field on 
X. Given a set function u : C — [0,00], does there exist a measure js on (X, 1) 
such that 4(C) = u(C) for all C € C, and is it unique? As mentioned in the 
introduction, this issue will not be treated in this book.” 


However, we shall now quote a fundamental result that we shall need in the 
chapter on martingales (Chapter 8). 


Caratheodory’s Theorem 


Definition 4.1.28 Let X be a set. The collection A C P(X) is called an algebra 
if 
(a) X € A; 


(8) A, BEA => AUBE A; 
(y) AEA = AEA 


The only difference with a o-field is that we require it to be closed under finite 
(instead countable) unions. (This is why a o-field is also called a o-algebra.) Note 
that, similarly to the o-field case, @ € A and A is closed under finite intersections. 


EXAMPLE 4.1.29: FINITE UNIONS OF DISJOINT INTERVALS. On R, the collection 
of finite sums of disjoint intervals is an algebra. (By interval, we mean any type 
of interval: open, closed, semi-open, semi-closed, infinite, etc., in other words a 
connected subset of R.*) 


? See [1], [3], or [11]. 
3 A subset C of R is called connected if for all a,b € C, the segment [a,b] CC. 
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Definition 4.1.30 Let X be a set. The collection C C P(X) is called a semi- 
algebra if 


(a) X EC, 
(8) A,B EC = AUBEC, and 


(y) when A €C, A can be expressed as a finite union of disjoint sets of C. 


EXAMPLE 4.1.31: THE COLLECTION OF INTERVALS. On R, the collection of 
intervals is a semi-algebra. 


Theorem 4.1.32 Let C be either an algebra or a semi-algebra defined on X. Let 
wu be a o-finite measure on (X,C). Then there exists a unique extension of tu to 
(X,0(C)) that is a measure. 


The proof is omitted *. 


4.2 The Integral 


We are now in a position to define (when it exists) the Lebesgue integral of a 
measurable function f : (X,) — (R,B(R)) with respect to a measure ps. This 
integral will be denoted by 


Irfan, or fy f(a) u(de), or pf): 


Let S*(X) (or S* if the context is clear) be the set of non-negative simple 
functions f : (X,¥) > (R, B(R)), and by M*(X) (or M*) the set of non-negative 
functions f : (X,%) > (R, B(R)). 


— 


The integral is defined in three steps. Firstly for simple functions, where the 
definition imposes itself. Secondly for non-negative measurable functions, by a 
natural limiting procedure involving the approximation theorem (Theorem 4.1.15), 
and finally for (some) functions of arbitrary sign by considering their negative and 
positive parts. 


STEP 1. One first defines the integral for integrands in S*. Let f: X — R 
be a non-negative simple Borel function as in Definition 4.1.7. The integral of f 
with respect to pu is defined by 


[ fdpu:= 2 a; U(A;) . (4.4) 


‘See for instance [12], Theorem 1.41 
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In order to check that this definition does not depend on the representation of f, 
one must show that if f admits another representation 


m 


f(x) = i 13,(2), 


where m € Ni, by,...,b, € R, and B,,...,B,, are sets in VY, then 


> bj w(Bi) = D0 a wa) (4.5) 


This verification is easy and left for the reader. 


The next lemma collects a few intermediary results. 


Lemma 4.2.1 Let f, fi, fo,... be in St. Then 
(a) for allX>0, Af € St and J, (Af) du =f, f du, 
(b) fit fee S* and fy (fit fo) du = fy fidut fy fodu, 
(c) fi < fo implies fy fidu< fy fodu, 
(d) fi A fo and fi V fo are in St, and 
(e) of fn < faui < f for alln > 1 and limp fa = f, then limptoc ft fr du = 


Sx fap. 


Proof. Properties (a)—(d) are immediate. For (e), first consider the case f = 1y. 
Fix m > 1. For all n > 1, define Anm = {w: fr(x) > 1- +}. Since {fn}n>1 
is non-decreasing, we have that Ajnm C An+i,m; and since f, t 14, we have that 
Ure Anm = A. Note that 


1 
La Lanm <n < 1a, 
= ; 
and therefore 


m 


(: = ~) iA ars is Fn du < pA), 


from which the announced result follows in this particular case by first letting 
nt +oo and then m t +co. 
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Consider now the general case where f = oo ajl4,. We may suppose that 
{A;}*_, is a partition of X, so that fy = 0%, fala, and therefore, by (a), 


k k 
3 3 Fn 
indy = [ ta d= a | 1 dp. 
u i=1 7X ‘ re a i 


Passing to the limit n ¢ co yields the desired result, since a A, t 14;- 


STEP 2. The integral will now be defined for integrands f € M*. For such 
f, let 


wi) =su f eauies tees}. 
x 

The function f is called p-integrable if u(f) < oo. 

We first check that if f € St, w(f) = f, fd. For this, let 


Ar={f vuipstvest). 
xX 


Since f € St, fy fdu © Af and therefore u(f) > fy f du. On the other hand, 
for all p € S* such that p< f, fy pdu < fy fd and therefore u(f) < fy f dy. 
Therefore pu(f) = fy f dy. 


Having checked this point, it is now safe to call y(f) the integral of f with 
respect to pz, and denote it also by f[. xy / du. Indeed the two ways of defining 
be f du for f € S* (as in Step 1 and Step 2) give the same result. 


The next result, due to Beppo Levi, is the monotone convergence theorem. 


Theorem 4.2.2 Let {fnr}n>1 be a non-decreasing sequence of non-negative mea- 
surable functions from X to R. Then 


limntoo te fe du = fe (limptoo iin) dy : 


Proof. We shall use the following monotonicity property: if f,, fo in M7 are such 
that fi, < fo, then fy fidu < fy fodu. In fact, Ay, C Ay,, and therefore 


u(fi) = sup As, < sup Af, = p(fo)- 


We now turn to the proof of the theorem. Denote limpto fn by f. By the just 
proved monotonicity property of the integral of functions in Mr, 


[ trews f fondus f tay. 
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Being a non-decreasing sequence, { { x fn dp}n>1 has a limit, and by the previous 


inequality 
lim f fadus ff fap. 
ntoo Jy xX 


It remains to prove the converse inequality. For y € S* such that y < f, A € (0,1) 
and n > 1, define the (measurable) set E, = {fn > Ap}. We have that E, C Enis. 
Moreover Un>i Ey, = X. Since Avlz, < fn, 


[orvtadns f fra stim [ fiay. 
x x ktoo J x 


On the other hand, since EF, C Ep4, and Un>i£, = X, we have that 1g, f 1 and 
in particular 1z,y t y. Therefore by (e) of Lemma 4.2.1, limptoo fy Ayla, du = 
f yApdu. Passing to the limit n f co in the last displayed inequalities, we have 
that for all \ € (0, 1), 


af yp dp < im fn de. 
x MOD LX. 


This equality remains true at the limit \ = 1. This being true of all y € St such 


that y < f, we have 
i: fdu< lim f fn dy. 
x ntoo Jx 


Here is another collection of intermediary results that we group in a lemma for 
later reference: 


Lemma 4.2.3 Let f, fi, fo be in Mt. Then 
(i) for all A >0, fy (Af) du =A fy f du, 
(ii) fx (fit fo) du = fy fidut fy fodu, and 
(iii) if fi < fo, then fy fidu< fy fodu. 
Proof. (iii) was obtained in the proof of Theorem 4.2.2. Properties (i) and (ii) 
are satisfied for functions in St (Lemma 4.2.1). Using non-decreasing sequences 


of functions in S*, {fin}nsi and { fon}n>1, converging respectively to f; and fa, 
we have that for all n > 1, 


[Ohm =A f fina 
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[tan fam) du= fi finde f fama. 


Letting n t oo, the monotone convergence theorem 4.2.2 yields (i) and (ii). 


and 


The next result is a fundamental technical tool, called Fatou’s lemma: 


Theorem 4.2.4 Let {fr}nsi be a sequence of non-negative measurable functions 
from X toR,. Then 


[imine fn) du < lim int [ ip Gl. 
x ntco ntoo x 


Proof. Define f := liminfptos fr = limpto (infesn f,). By the monotone conver- 
gence theorem (Theorem 4.2.2) for the second equality, we obtain 


a fdy= [oi (lim nas fr) due = = (inf Fr) Us 


x kon 


On the other hand, since for all 7 > n, {Gite fr) que < Aes fi du, we have that 
J (infrsn fe) du < infion (fy f, dus). Therefore 


us < tim we ¢) fi ays) = = lim mint fn du. 


STEP 3. Integrals of functions of arbitrary sign. 


Definition 4.2.5 A measurable function f : (X,¥)— (R, B(R)) satisfying 


[ \flew<ce 


is called a -integrable function. 


Define f* := max(f,0) and f~ := max(—/f,0). In particular, f = ft —f~ and 
~ <|f|. Therefore, by monotonicity (Property (iii) of Lemma 4.2.3), 
Jy fides fy fl dp. 
Thus, if f is integrable, the right-hand side of 


Sx fdr= fy fo du — fy fo dp (4.6) 


ie 
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is meaningful (no —oo + co form) and defines the integral of the left-hand side. 
Moreover, the integral of f with respect to y defined in this way is finite. 


The integral can be defined for some non-integrable measurable functions, for 
instance, as we have seen, for all measurable non-negative functions. More gener- 
ally, if f : (X,4) > (R, B(R)) is such that at least one of the integrals [, ft du 
or Jy f~ dy is finite, one defines the integral as in (4.6). This leads to one of the 
forms “finite minus finite”, “finite minus infinite”, and “infinite minus finite”. The 
case which is rigorously excluded is that in which u(f*) = u(f~) = +00. 


Let A € &X. The following equality is a definition of the left-hand side provided 
the right-hand side is well defined: 


[ sermtaa) = fale) F(@) war). 


For a complex Borel function f : X — C (that is, f = f, + 7fo, where 
fi, fo: (X, ¥) > (IR, B(R))) such that p(|f|) < 00, let 


[fem f fawti f raw. 


EXAMPLE 4.2.6: INTEGRAL WITH RESPECT TO THE DIRAC MEASURE. Let 
(X,¥) be an arbitrary measurable space and let ¢, be the Dirac measure at a 
point a € X. Let f : (X,4’) > (R,B(R)). We shall prove formally that it is 
€,-integrable and that 

caf) = f(a). 


For a simple function f as in (4.1), we have 


k 


ea(f) = >. ai€a(Ai) = > ai 1a,(a) = f(a). 


i=l i=1 


For a non-negative function f, and any non-decreasing sequence of simple non- 
negative measurable functions {f;,},,., converging to f, we have 


éa(f) = ae Eal fn) = Be fr(a) = f(a). 
Finally, for any f : (X,¥V) > (R, B(R)) 


éa(f) = €a(f*) — ea(f-) = f* (a) — f(a) = f(@) 


is a well defined quantity. 
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What we have finally obtained is the formula 


[ tee)ealde) = fla), 


The aficionados of the so-called Dirac “function” 6 like to write the left-hand side 


[ fester or f Fla) se— a) 


EXAMPLE 4.2.7: LEBESGUE-INTEGRABLE BUT NOT RIEMANN-INTEGRABLE. 

The function f defined by f := 1g (Q is the set of rational numbers) is a 
Borel function and it is Lebesgue integrable with its integral equal to zero because 
{ f 4 0} is the set of rational numbers, which has null Lebesgue measure. However, 
f is not Riemann integrable. 


We finally define the Stieljes—Lebesque integral. 


Definition 4.2.8 Let F be cumulative distribution function on (R,B(R)) and let 
ip be the associated locally finite measure on (R, B(R))) (see Example 4.1.26). By 
definition, the Stteltjes-Lebesqgue integral of the measurable function 
g : (R,B(R)) > (R,B(R)) with respect to F is the integral of g with respect 
to ip. It is denoted by J, g(x) dF (x). Therefore 


io dF (x) = [oe jip(da) . 


4.3. Basic Properties of the Integral 
We first state and prove one of the most important results of integration theory, 
the Lebesgue theorem, also called the dominated convergence theorem. 


Theorem 4.3.1 Let {fn}n>1 be a sequence of measurable functions from (X, X) 
o (IR, B(R)) that converges to a (necessarily) measurable function f. Suppose 
moreover that for alln > 1, |fn| < g, where g is integrable. Then 


lim nf ine fa lim fn) dy 
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Proof. By Fatou’s lemma applied to the sequence of non-negative functions 
{9 + Fa}ast 


[tot tau= f timo foe 
xX x ntoo 
<timint f (a+ fayeu= f gay +timint [ fn du. 
nTco Da too 


J fev stimint fn de. 
Xx melo” oh 


Similarly, replacing f and f, by —f and —f,, respectively, 


Therefore, 


fe > lim sup fn de. 


ntoo 


In particular, limptoo f x fn dp exists and is equal to f el Oe 


Recall that for all A € &, 
| 14 du = (A) (4.7) 
Xx 


by definition and that the notation ty f dw stands for f laf du. 


Theorem 4.3.2 Let f,g : (X,¥) — (R,B(R)) be p-integrable functions, and let 
a,b€R. Then 


(a) 
(b) if f =0 p-ae., then p(f) = 0; if f=g p-ae., then u(f) = u(y), 
) ff <g p-ae., then u(f) < n(g), 

d) |ue(f)| < HFN), 

e) if f >0 p-ae. and p(f) =0, then f =0 p-ae., 


af +g is -integrable and p(af + bg) = au(f) + bu(g), 


f) if u(laf) =0 for all A € X, then f =0 p-a.e., and 
(g) if f is u-integrable, then |f| < 00 p-a.e. 
Proof. The (easy) proofs of (a)—(c) are omitted. 


iy uf) = w(f+) — wf). Therefore |u(f)| < w(f+) + eG) = wf + £) = 
Lb ; 
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(e) Define A, = {f > +}. Since f is non-negative, f > +14, and therefore, 


uf) > —u(An), 


Sir 


from which it follows that, since u(f) = 0, “(A,) = 0, and limps, (An) = 0. But 
the sequence of sets {An }n>1 increases to {f > 0} and therefore, by sequential 
continuity, u({f > 0}) = 0, that is, f < 0, u-a.e. On the other hand, by hypothesis, 
f => 0, p-a.e. Therefore f = 0, p-a.e. 


(f) With A = {f > 0}, laf is a non-negative measurable function. By (e), 
laf = 0,p-a.e. This implies that 14 = 0,p-a.e., that is to say f < 0,prae. 
Similarly, f > 0, p-a.e. Therefore, f = 0, pi-a.e. 


(g) It is enough to consider the case f > 0. Since f > nlp—o.} for all n > 1, 
we have 


oo > p(f) > nu({f = co}), 


and therefore ny({f = co}) < oo. This cannot be true for all n > 1 unless 


wf =cof)=0. 


The extension to complex Borel functions of the properties (a), (b), (d) and 
(f) is immediate. 


Beppo Levi, Fatou and Lebesgue 


The following versions of the theorems of Beppo Levi, Fatou and Lebesgue dif- 
fer from the previous ones by the introduction of “y-almost everywhere” in the 
statements of the conditions. No other proofs are needed since integrals of almost 
everywhere equal functions are equal and countable unions of negligible sets are 
negligible. Only a convention must be stated: if the limit of a sequence of real 
measurable functions exists -almost everywhere, that is, outside a p-negligible 
set, then the limit is typically assigned some arbitrary value on this ju-negligible 
set. 


Remember that we are looking for conditions guaranteeing that 


| lim f, du = lim f fn dy. (4.8) 
x too ntoo Jy 


We start by restating the monotone convergence theorem. 
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Theorem 4.3.3 Let fn: (X,¥) > (R, B(R)) (n> 1) be such that 
(i) fr > 0 p-a.e., and 
Gee paeee 
Then, there exists a non-negative function f : (X,¥) — (IR, B(R)) such that 


a jin=f [U-OLe, 


and (4.8) holds true. 


Next, we restate Fatou’s lemma. 


Theorem 4.3.4 Let f, : (X,¥) > (R,B(R)) (n> 1) be such that f, > 0 p-ae. 
(n> 1). Then 


[cimint Ineln < aaa ¢ de au) ; (4.9) 


Finally, we restate the Lebesgue or dominated convergence theorem. 
Theorem 4.3.5 Let fy : (X,¥) > (R, B(R)) (n > 1) be such that, for some 
function f : (X,¥) — (R,B(R)) and some j.-integrable function g : (X,¥) > 
(R, B(R)): 

(i) lim je = Iho OE. Cina 

(ii) |fnl < |g| p-a.e. for alln > 1. 
Then, (4.8) holds true. 


EXAMPLE 4.3.6: ‘THE CLASSICAL COUNTEREXAMPLE. 
Let (X, 4, js) = (R, B(R), 2), and let 


fr(2) = 7 1,4) (2). 


One has limytoo fn = 0. Therefore p(limpto fn) = 0. However, (f,) = 1 for all 
n> 1. 


Differentiation under the Integral Sign 


Let (X,4, 1) be a measure space and let (a,b) C R. Let f : (a,b)x X OR 
and for all t € (a,b), define f,: X > R by f(x) :-= f(t,2). Suppose that for 
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all t € (a,b), ft: is measurable with respect to ¥, and define, when possible, the 
function I : (a,b) > R by the formula 


TG) = f, f(t, a) wldz). (4.10) 


Theorem 4.3.7 Assume that for j-almost all x the function t++ f(t,x) is con- 
tinuous at to € (a,b) and that there exists a p-integrable function g : (X,¥) > 
(R, B(R)) such that |f(t,x)| < |g(a)| p-a.e. for all t in a neighborhood V of to. 
Then: 


A. I is well defined and is continuous at to. 
B. We now assume in addition that 
(a) t > f(t,x) is continuously differentiable on V for u-almost all x, and 
(8) for some p-integrable function h: (X,¥) > (R, B(R)) and allt €V, 
\(df/dt) (¢,2)| < |h(z)| p-ae. 
Then I is differentiable at to and 
F'(to) = fie(Af /at) (to, 2) ld) (4.11) 


Proof. A. Let {t,}n>1 be a sequence in V \ {to} such that lim,;.t, = to, and 
define f,(a) = f(tn, x), f(x) = f(to, x). By dominated convergence, 


lim I(tn) = lim (fn) = wl f) = Ito) 


ntoo 


B. Let {tr}n>1 be a sequence in V \ {to} such that limptotn = to, and define 
fr(x) = f(tn, x), f(x) = f(to, x). By dominated convergence, 


lim I(tn) = lim u(fn) = u(F) = I(to) . 


ntoo 


Also 


I (tn (to) F (tas 2) = Fito, 2) ‘i 
aa =} i. =e (da) , 


and for some 6 € (0,1), possibly depending upon n, 


f (tn: x) = Ff (to, x) 


£, — to < |(df /dt) (to + O(tn — to), x)| 
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The latter quantity is bounded by |h(a)|. Therefore, by dominated convergence, 


jig eto) / (in Se) Wan) 


ntoo tn — to ntoo tn — to 


7. (df /at) (to, 2) u(dx) 


4.4 The Big Theorems 


The Image Measure Theorem 


Definition 4.4.1 Let (X,4) and (E,€) be two measurable spaces, let 
h: (X,¥) > (E,€) be a measurable function, and let 1 be a measure on (X, X). 
The measure 10 h~' on (E,€), called the image of js by h, is defined by 


(woh")(C) =p(h"(C)), CeEE€. 


(One easily checks that it is indeed a measure.) 


In the proof of the following theorem, the combination of the approximation 
theorem for measurable functions and of the monotone convergence theorem is 
typical. 

Theorem 4.4.2 For f : (X,4) — (R,B(R) an arbitrary non-negative measur- 


able function 


/ (fo h)(a) (der) = i F(y)(o AY (dy). (4.12) 


For functions f : (X,#) > (R, B(R) of arbitrary sign either one of the conditions 
(a) foh is p-integrable, or 
(b) f is wo h~+-integrable, 

implies the other, and equality (4.12) then holds. 


Proof. The equality (4.12) is readily verified when f is a non-negative simple 
measurable function. In the general case one approximates f by a non-decreasing 
sequence of non-negative simple measurable functions {fnr}n>1 and (4.12) then 
follows from the same equality written with f = f,, by letting n + co and using 
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the monotone convergence theorem. For the case of functions of arbitrary sign, 
apply (4.12) with f* and f-. 


The Radon—Nikodym Theorem 


Definition 4.4.3 Let (X, 4, 1) be a measure space and let h : (X,X) > (R, B(R)) 
be a non-negative measurable function. Define the set function v : ¥ — [0,00] by 


Then v is a measure on (X, 4X) called the product of ys by the function h. This is 
denoted by dv = hd. 


That v is a measure is easily checked. First of all, it is obvious that v(@) = 
0. As for the o-additivity property, write for any sequence of mutually disjoint 
measurable sets {An }n>1, 


M(Unsidn) = f han = [ lus Ant du 
Un>1An x 


k 
= la, han = | lim ly, | hd 


k k 
= lim 1 hdu = lim 14 hd 
lim [ (>: «) p= lim) [ 4, hd 


n=1 
k 
= lim > v(An) = >_v(An) , 


n>1 


where the fifth equality is by monotone convergence. 


4.4. THE BIG THEOREMS 163 


Theorem 4.4.4 Let uw, h and v be as in Definition 4.4.3. 
(i) For non-negative f : (X,#) > (R, B(R)), 
ea) rida) — ff a\h(ay pda). (4.13) 
(ii) If f : (X,¥) > (R, B(R)) has arbitrary sign, then either one of the following 
conditions 
(a) f ts v-integrable, 
(b) fh ts p-tntegrable, 
implies the other, and the equality (4.13) then holds. 


Proof. Verify (4.13) for elementary non-negative functions and, approximating 
f by a non-decreasing sequence of such functions, use the monotone convergence 
theorem as in the proof of (4.12). For the case of functions of arbitrary sign, apply 
(4.13) with f = f* and f = f-. 


Observe that in the situation of Theorem 4.4.4, 


p(C) =0 ViCc)=0 (CE#). (4.14) 


Definition 4.4.5 Let and v be two measures on (X,%¥). If (4.14) holds, v is 
said to be absolutely continuous with respect to . This is denoted byv & yu. 


The proof of the next theorem is omitted ® 


Theorem 4.4.6 Let p. and v be two o-finite measures on (X,X) such thatv < w. 
Then there exists a non-negative function h: (X,*¥) > (IR, B(R)) such that 


v(dx) = h(a) u(dz). 


The function h is called the Radon—Nikodym derivative of v with respect to ju 
and is denoted dv/dy. With such a notation, we have that 


Jx f(a) v(de) = fy f ) w(dar) 


for all non-negative f = (X, 4) > (R, B(R)). 


5 See Subsection 2.3.1 of [5]. 
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The Fubini—Tonelli Theorem 


Let (X1, ¥1, 1) and (X2, ¥, 2) be two measure spaces where 4; and juz are sigma- 
finite. Define the product set X = X, x Xo and the product o-field X = X, ® %, 
where by definition the latter is the smallest o-field on X containing all sets of the 
form A, x Ag, where A; € 4X, Ao € AX. 


Theorem 4.4.7 There exists a unique measure 1 on (X1 x Xo, ¥%,@ X2) such that 
[Ay x Ag) = p1(A1)pH2(A2) (4.15) 
for all Ay € X,, Ag € Xo. 
The proofs of this theorem and of the next one are omitted ©. 
The measure jz is the product measure of 41 and pz, and is denoted py X fla. 


The above result and the following ones are stated for products of two sigma- 
finite measures, but extend in an obvious manner to a finite number of sigma-finite 
measures. 


The typical example of a product measure is the Lebesgue measure on the 
space IR", B(IR")): It is the unique measure @” on that space that is such that 
é" (11, A;) = IT"_, €(A;) for all Al, bk hy A, € B. 


Theorem 4.4.8 Let (X1,%1, 1) and (X1, ¥2, 2) be two measure spaces in which 
[1 and fg are sigma-finite. Let (X,%¥,w) = (Xy x Xo, X1 @ Xo, pr X 2). 


(A) Tonelli. If f is non-negative, then, for 1-almost all x, the function r2 
f (a1, %2) is measurable with respect to X», and 


on f f(&1, ©2) Me (dx2) 


is a measurable function with respect to X,. Furthermore, 
/ fdp =| | Feit) pads) fi(day). (4.17) 
x X1 XQ 


(B) Fubini. Jf f ts y-integrable, then, for 1-almost all x, the function x2 > 
f (a1, %2) is fg-integrable and x1 +> Ie f (a1, %2) fo(da2) is 22-integrable, and 
(4.17) is true. 


We shall refer to the global result as the Fubini-Tonelli Theorem. Part (A) 


® See Subsection 2.3.2 of [5]. 
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says that one can integrate a non-negative measurable function in any order of its 
variables. Part (B) says that the same is true of an arbitrary measurable function 
if that function is p-integrable. In general, in order to apply Part (B) one must use 
Part (A) in order to ascertain whether or not f is p-integrable. The next example 
should convince the reader of the necessity of checking this integrability condition. 


EXAMPLE 4.4.9: WHEN FUBINI IS NOT APPLICABLE. Consider the function f 
defined on X, x X2 = (1,00) x (0,1) by the formula 


f(x, Z2) =e 71t2 _ Jen 2142 ; 


We have 


e722 — e72@2 


/ f(®1, 22) dz, SS h(x2) > 0, 
(1,00) 


feta) dry = —————— ——S h(x1) : 


(0,1 


However, . 
[ Wesdara 4 f (-rlm))an, 


since h > 0 ¢-a.e. on (0,00). We therefore see that successive integrations yields 
different results according to the order in which they are performed. As a matter 
of fact, f is not integrable on (0,1) x (1, 00). 


The Formula of Integration by Parts 


Theorem 4.4.10 Let ju; and p2 be two o-finite measures on (IR, B(R)). For any 
interval (a,b) CR 


jia((@, dl) walt) + i 


(a, 


pus (a, ))q2((a,8)) = / 


(a,b) i fi2((a,t)) pi(dt). (4.18) 


Observe that the first integral features the interval (a,t| (closed on the right), 
whereas in the second integral, the interval is of the type (a,t) (open on the right). 


In terms of Lebesgue-Stieltjes integrals, 


F,(b)Fy(b) — Fy(a)F,(a) = / 


Fy (x) dFo(a Fy(x—) dF \ (x), 
® + (a) dFi(2) 


(a,b) 


where F, and Fy are CDFs on R. This is the Lebesgue-Stieltjes version of the 
integration by parts formula of calculus. 
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Proof. The proof consists in computing the jz; x jug-measure of the square D := 
(a, b] x (a, b] in two ways. The first one is obvious and gives the left-hand side of 
(4.18). The second one consists in observing that (D) = w(D,) + (D2), where 
D, = {(z,y);a< y <b,a <a < y} and Dy = {(a,0] x (a,0]} A Dy. Then p(D,) 
and j(D2) are computed using Tonelli’s theorem. For instance, 


y(D1) = / ( | 1p, (c.y)aa (a) wap 
LU Lfoceepi (dz) J [2(dy) =f wal(a,9)) wala). 


L?-spaces and the Riesz—Fischer Theorem 


For a given integer p > 1, Li,(u) is, roughly speaking (see the details below), 
the collection of complex-valued measurable functions f defined on X such that 
f - | f|P dye < co. We shall see that it is a complete normed vector space over C, 
that is, a Banach space. 


Let (X, 4, 1) be a measure space and let f, g be two complex-valued measurable 
functions defined on X. The relation R defined by 
fRg if and only if f= g p-ae. 


is an equivalence relation. Denote the equivalence class of f by {f}. Note that for 
any p > 0 (using property (b) of Theorem 4.3.2), 


fRg = iy f/P de = Te lg/? du. 


* 


The operations +, x, * and multiplication by a scalar a € C are defined on the 


equivalence class by 


{f}+{ot={f +o}, {fH} ={fol. (fF ={ft, aff} =fof}. 


The first equality means that {f}+ {g} is, by definition, the equivalence class 
consisting of the functions f + g, where f and g are members of {f} and {g}, 
respectively. Similar interpretations hold for the other equalities. 


By definition, for a given p > 1, Li,(1) is the collection of equivalence classes 
{f} such that fy |f|’ du < oo. Clearly, it is a vector space over C (for the proof 


recall that 
(Uaglal < SFP + Blo 
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since t — #? is a convex function when p > 1). In order to avoid cumbersome 
notation, in this section and in general whenever we consider L?-spaces, we shall 
write f for {f}. This abuse of notation is harmless since two members of the same 
equivalence class have the same integral if that integral is defined. Therefore, using 
this loose notation, we may write 


Lo(u) = {f= Sx lflP dp < oo} . (4.19) 


When the measure is the counting measure on the set Z of relative integers, 
the traditional notation is ¢7,(Z). This is the space of random complex sequences 
{tn}nez such that 


SS lan <8. 


neZ 


The following is a simple and often used observation. 


Theorem 4.4.11 Let p and q be positive real numbers such that p > q. If the 
measure pp on (X,X,p) is finite, then L?,(w) C LE(u). In particular, L3.(u) C 


Le(u). 


Proof. From the inequality |a|? < 1+ |al?, true for all a € C, it follows that 
MFI) Sw) + W(1FIP). Since w(1) = u(R) < 00, (||?) < 00 whenever u(|f|?) < 
oo. 


This inclusion is not true in general if jz is not a finite measure, for instance 
consider the Lebesgue measure ¢ on R: there exist functions in Li,(¢) that are not 
in L2,(¢) and vice versa. 


In the case of the counting measure on Z, the order of inclusion is the reverse 
of the one concerning finite measures: 


Theorem 4.4.12 2, inclusions. If p > q, (4,(Z) C &,(Z). In particular, €4(Z) Cc 
&2,(Z). 


Proof. Exercise 4.5.19. 
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Theorem 4.4.13 Let p and q be positive real numbers in (0,1) such that 

1 il 

=i i= il 

Pp q 
(p and q are then said to be conjugate) and let f,g : (X,¥) +> (R,B(R)) be 
non-negative real functions. Then, 


Sc fo du < [fx fey)” [fy 9% du]. (4.20) 


In particular, if f,g € L3,(R), then fg € Li(R). 


Proof. Let 
1/p 1/q 
An (frau) B= (fara) 
Xx Xx 


It may be assumed that 0 < A,B < co, because otherwise Hélder’s inequality is 
trivially satisfied. Let F := f/A, G := g/B, so that 


[pews | orau=1. 
Be Xx 


Suppose that we have been able to prove that 


1 1 
F(x)G(a) < rae + Pa (4.21) 
Integrating this inequality yields 
1 1 
[rows -+-=1, 
x Pp q 


and this is just (4.20). 


Inequality (4.21) is trivially satisfied if x is such that F = 0 or G = 0. It is 
also satisfied in the case when F' and G are not p-almost everywhere null. Indeed, 
letting 

s(x) =pln(F(a)), (2) = qln(G(2)), 
from the convexity of the exponential function and the assumption that 1/p+1/q = 
1, 
es@i/rttead/a cE gota) y 1 nay, 
Pp qd 
and this is precisely inequality (4.21). 


For the last assertion of the theorem, take p = q = 2. 
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Theorem 4.4.14 Let p > 1 and let f,g : (X,¥) +> (R,B(R)) be non-negative 
functions in Li,(). Then, 


Goal ada) eee) (4.22) 


Proof. For p = 1 the inequality (in fact an equality) is obvious. Therefore, assume 
p> 1. From Holder’s inequality 


frvsarsl frelon) 
fw +9 du < fe g a L/p fu oie] ve 


Adding up the above two inequalities and observing that (p — 1)q = p, we obtain 


[ir rarans ([/ ran] [[ra”) fuse)”. 


One may assume that the right-hand side of (4.22) is finite and that the left-hand 
side is positive (otherwise the inequality is trivial). Therefore [,(f +g)? du € 


(0, 00) and we may therefore divide both sides of the last display by Lie (f +49)? dy] Ua 
Observing that 1 — 1/¢ = 1/p yields the announced inequality (4.22). 


and 


Theorem 4.4.15 Let p> 1. The mapping v, : Li,(1) + (0,00) defined by 


olf) = (Sx LEIP du)” (4.23) 
is a norm on Li,(q). 


Proof. Clearly, v,(af) = |a|v,(f) for alla € C, f € Li(u). Also, v»(f) = 0 if 
and only if (fe | f|P du)!” = 0, which in turn is equivalent to f = 0, pi-a.e. Finally, 
Y(f +9) <(f) +%(g) for all f,g € Li,(u), by Minkowski’s inequality. 


Denoting v,(f) by || f||p. Z4,() is a normed vector space over C, with the norm 
|| - |» and the induced metric d,(f, g) := ||.f — gllp- 


Theorem 4.4.16 Let p > 1. The metric d, makes of Li.() a complete normed 
vector space. 


In other words, Li,() is a Banach space for the norm || - ||p. 


Proof. To show completeness one must prove that for any sequence {f,,},,., of 
L4.(4) that is a Cauchy sequence (that is, such that limp ntoo dp( fn; fm) = 9), there 
exists an f € L(y) such that limptoo dp(fn, f) = 0. 
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Since {f;,},,, is a Cauchy sequence, one can select a subsequence {f,,},., such 
that 7 7 


Goal raat dn) Se (4.24) 
Let 


k fore) 
Gk = SS [Frees _ Se ,I= DS i ee = fri 
i=1 i=1 


By (4.24) and Minkowski’s inequality, ||gx||,» < 1. Fatou’s lemma applied to the 
sequence {g?},., gives ||g||p < 1. In particular, any member of the equivalence 
class of g is jt-almost everywhere finite and therefore 


Fol ®) + ys Freal®) = Fn;(@)) 


converges absolutely for y-almost all x. Call the corresponding limit f(a) (set 
f(x) =0 when this limit does not exist). Since 


k-1 
Jia =r S- (a ~ tre) = Fin 
i=l 


we see that 
f=lim fy jac. 
One must show that f is the limit in L7,() of {fon tnt Let « > 0. There exists an 


integer N = N(e) such that || fn — fmllp < € whenever m,n > N. For allm > N, 
by Fatou’s lemma we have 


[lt folPau s timint ffs, — ful? de < &. 
xX 1-00 xr 


Therefore f — fin € L(y) and consequently f € Li(s). It also follows from the 
last inequality that 
im Il f = fin|lp =0. 


The next result is a by-product of the proofs of Theorems 4.4.16. 
Theorem 4.4.17 Let p > 1 and let {fr},5, be a convergent sequence in Li (1). 


Let f be the corresponding limit in Li,(u). Then, there exists a subsequence 
{faibios such that 


itt Hin =I rn (4.25) 
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Note that the statement in (4.25) is about functions and not about equivalence 
classes. The functions thereof are any members of the corresponding equivalence 
class. In particular, when a given sequence of functions converges ji-a.e. to two 
functions, these two functions are necessarily equal pi-a.e. Therefore, 


Theorem 4.4.18 Jf {f,},,5, converges both to f in Li.(u) and to g pi-a.e., then 
if =O Oe 7 


Of special interest for applications is the space L2,(j.) of complex measurable 
functions f : X — R such that 


[ If (a)[? u(dr) < 00 


where two functions f and f’ such that f(a) = f’(x), pra.e. are not distinguished. 
We have by the Riesz—Fischer theorem: 


Theorem 4.4.19 L2,(j1) is a vector space with scalar field C, and when endowed 
with the inner product 


= i f(a)o(o)* wlan), (4.26) 


it 1s a Hilbert space. 


The norm of a function f € Li,(1) is 


I fll = Ux LF@P (dey)? 


and the distance between two functions f and g in L3,(j) is 
1 
2 


9) = (Sx |f(@) — 9(2)P w(de))’ 


The completeness property of L3,(j.) reads in this case as follows. If {f,},., is a 
sequence of functions in Lz,(j) such that 7 


tim fae) — fC) (dz) = 0, 


m,ntoo 


then, there exists a function f € L2(j) such that 


lim f (2) - DP ua 0, 


ntoo 


In Li,(u), Schwarz’s inequality reads as follows: 


[tester wlan < (fre y#u(an))  ( [la w(ar)) 


Ne 
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EXAMPLE 4.4.20: COMPLEX SEQUENCES. The set of complex sequences a = 


{an}nez Such that 
S lan|” < 00 


neZ 


is, when endowed with the inner product 
(a,b) = > And; , 
neZ 


a Hilbert space, denoted by (2,(Z). This is indeed a particular case of a Hilbert 
space Li.(u), where X = Z and yp is the counting measure. In this example, 
Schwarz’s inequality takes the form 


yal (Sleu") x (Siar) 
neZ neZ 


neZ 


4.5 Exercises 


Exercise 4.5.1. SET INVERSE FUNCTION 
Let U and E be arbitrary sets and let f be some function from U to E. For any 
subset A C E, let 

f(A) = {ue U; flu) € A}. 
(i) Show that for all u € U, La(f(u)) = 1y-yay(w). 
(ii) Prove that if € is a o-field on E, then the collection of subsets f~'(€) := 
{f-'(A); A € €} is a o-field on U. 


Exercise 4.5.2. NO TITLE 
Let f be a function from R to R. Prove that for any a € R, 


Onei{ey fle) Sa+1/n} = {ey fiz) <a}. 


Exercise 4.5.3. 0-FIELD GENERATED BY A COLLECTION OF SETS 
Let J be an arbitrary non-empty index set. 


(1) Let {F;}ier be an arbitrary non-empty family of o-fields on some set 2. Show 
that the family F := NierF; (A € F if and only if A € F; for alli € J) is a o-field. 
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(2) Let C be an arbitrary family of subsets of some set 2. Prove the existence of 
a smallest o-field F containing C. (This means, by definition, that F is a o-field 
on 2 containing C, such that if F’ is a o-field on 2 containing C, then F C F’.) 


Exercise 4.5.4. 6(IR”) 

Prove that G(R”) is generated by the collection C of all rectangles of the type 
T]1(—co, ai], where a; € Q (i € {1,...,n}). (Q is the set of rational numbers.) 
Exercise 4.5.5. GROSS SIGMA-FIELD 

Show that a function f : X — R that is measurable with respect to the gross 
sigma-field on X and the Borel sigma-field on R is a constant (takes only one 
value). 


Exercise 4.5.6. |f| MEASURABLE, f NOT MEASURABLE 

Let (X, 4’) be a measurable space such that & # P(X) (for instance if (X, 4) = 
(R, B(R)), a fact that we shall admit here). Let f : X > E be a function. Is it 
true that if |f| is measurable with respect to ¥ and €, then so is f itself? 


Exercise 4.5.7. SEQUENTIAL CONTINUITY 
In Theorem 4.1.21, show by means of a counterexample the necessity of the con- 
dition p4(B,,) < co for some no. 


Exercise 4.5.8. THE RATIONALS ARE LEBESGUE-NEGLIGIBLE 
Prove that any singleton {a} (a € R) is a Borel set of null Lebesgue measure and 
that the set of rationals Q is a Borel set of null Lebesgue measure. 


Exercise 4.5.9. INTEGRAL OF A SIMPLE FUNCTION 
Prove (4.5). 


Exercise 4.5.10. INTEGRAL WITH RESPECT TO THE WEIGHTED COUNTING 
MEASURE 


Any function f : Z— R is measurable with respect to P(Z) and B(R). With the 
measure jy defined in Example 4.1.18, and with f > 0 for instance, show that 


uf) = So anf (n) 


by following exactly the steps of the general construction. 
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Exercise 4.5.11. FOURIER TRANSFORM 
The Fourier transform of a function f : R — R that is integrable with respect to 
Lebesgue measure is the function f : R > C defined by: 


fv) := I fier de. 


This is denoted by F : f > a 


Prove that ‘i is bounded and uniformly continuous. 


Exercise 4.5.12. CONVOLUTION OF INTEGRABLE FUNCTIONS 
Let h,f : R — R be functions that are integrable with respect to Lebesgue 
measure. Prove that the right-hand side of 


g(t) = [me — s) f(s) ds 


is well defined almost everywhere (for the Lebesgue measure), and defines an in- 
tegrable function. (The function g is the convolution of h with f, and is denoted 
by g=h*x f.) 


Exercise 4.5.13. THE FOURIER CONVOLUTION—MULTIPLICATION RULE. 
(Continuation of Exercise 4.5.12) Prove that 


Fih«xf—ohf. 


Exercise 4.5.14. IMAGE MEASURE 


What is the measure on (IR, B(R)) that is the image of the Lebesgue measure ¢ 
on (IR, B(R)) by the map a + |a|? 


Exercise 4.5.15. SCHEFFE’S LEMMA 

Let f and f,, (n > 1) be p-integrable non-negative functions such that limptos fn = 
f prae. and limntos fy fndu = fy fdu. Show that limpto fy lfn— fl du = 0. 
(Hint: |a — b| = a + b — inf(a, b).) 


Exercise 4.5.16. FUBINI TILES 

Consider any bounded closed rectangle of IR?. We say that it has Property (A) if at 
least one of its sides “is an integer” (meaning: its length is an integer). Now you are 
given a bounded closed rectangle A that is the union of a finite number of disjoint 
closed rectangles with Property (A). Show that A itself must have Property (A). 
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Exercise 4.5.17. INTEGRALS AND SUMS 
Prove that for all a,b € R, 


te 1 
dt = : 
i: 1—e tt Dd (a + nb)? 


Exercise 4.5.18. FUBINI AGAIN 
Define f : [0,1]? > R by 


2 


wry 
(x? + y?)? 1{(,y)4(0,0)} - 


Compute f; oat Sloane LW tew) is) dy and Sio, y te (0,1)) fey) _) dz. Is f Lebesgue inte- 
grable on “el yl as 


f(z,y) = 


Exercise 4.5.19. (.(Z) 
Prove that £4,(Z) Cc ¢2,(Z) 


4.6 Solutions 


SOLUTION (Exercise 4.5.1). 

(i) La(f(u)) = 1 <> f(u) € A= > we f(A) = 1-4) (u) = 1. 

(ii) is a direct consequence of the definition of a o-field and of the following set 
identities: for any subsets A, Aj, Ao,... of F, 


f(A) = f(A), 
ae (n 4s) =) ins 
= (U 4s) = (Jf (A,)- 


n=1 


SOLUTION (Exercise 4.5.2). 
If f(x) < a++ for all n > 1, one cannot have f(x) > a, because if it wets the case, 
there would certainly exist a sufficiently large n such that f(x) >a ++ . Therefore 


asia: fa) e+ “}C fe; f(z) <a}. 
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If f(x) < a, then obviously f(x) < a+ 1/n for all n > 1. therefore 


{x; f(t) <a} CMnsif{x; f(x) <a4+1/n}. 


SOLUTION (Exercise 4.5.3). 
Obvious. 


SOLUTION (Exercise 4.5.4). 

It suffices to show that 6(R") is generated by the collection C’ of all rectangles 
T](G, :) with rational endpoints. Note that C’ is a countable collection and 
that all its elements are open sets for the Euclidean topology (the latter we denote 
by QO). It follows that C’ C O and therefore a(C’) C o(O) = B(R”). 


It remains to show that O C o(C’), since this implies that a(O) C o(C’). For 
this it suffices to show that any set O € O is a countable union of elements in C’. 
Take x € O. By definition of the Euclidean topology, there exists a non-empty 
open ball B(x,1) centered at x and contained in O. Now we can always choose a 
rational rectangle R, € C’ that contains x and that is contained in B(z,1). Clearly 
UreoR, = O. Since the R, are chosen in a countable family of sets, the union 
UreoR, is in fact countable. As a countable union of sets in C’ it is in o(C’). 
Therefore O € a(C’). 


SOLUTION (Exercise 4.5.5). 

Suppose it takes two distinct values a and b. Then {f = a} := {a € X; f(x) =a} 
and {f = b} are two distinct members of the gross sigma-field on X. One of them 
must therefore be X itself, say {f = a} =. Therefore f is the constant function 
equal to a. 


SOLUTION (Exercise 4.5.6). 
No. Take f = 14 — 1q where A is a non-measurable set. This function is clearly 
non-measurale (for instance {f = 1} = A ¢ ¥), but |f| = 1 is measurable. 


SOLUTION (Exercise 4.5.7). 
Let v be the counting measure on Z and let B, := {i € Z: |i] > n} (n > 1). Then 
v(B,) = +00 for all n > 1, whereas 


(N ».) =v(@) =0. 


n=1 
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SOLUTION (Exercise 4.5.8). 

The Borel o-field 6(R) is generated by the intervals I, = (—oo, a], a € R (Theorem 
4.1.4), and therefore {a} = Qn31(Ua — Io-1/n) is also in BUR). Denoting by ¢ the 
Lebesgue measure, &(Jg — Ig-1/n) = 1/n, and therefore ¢({a}) = limpsi Ua — 
I,-1/n) = 9. Q is a countable union of sets in B(R) (singletons) and is therefore in 
B(R). It has Lebesgue measure 0 as a countable union of sets of Lebesgue measure 


0. 


SOLUTION (Exercise 4.5.9). 
This follows from the following chain of equalities, where it is noted that if ANB; A 
@, then a; = b;: 


SOLUTION (Exercise 4.5.10). 
It suffices to consider the approximating sequence of simple functions 


Ink) = by F (9) 153 (4) 
whose integral is 
Y(fn) = Dy POM = DY Aes 


and to let n tend to co. When a, = 1, the integral reduces to the sum of a series: 
W(f) => f(r). 
neZ 


In this case, integrability means that the series is absolutely convergent. 
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SOLUTION (Exercise 4.5.11). 
From the definition, we have that 


I< I LF (t) e2™"| at = - FCO] at, 


where the last term does not depend on v and is finite. Also, for all h € R, 
e+) - feos f if@|le eet — eat 
R 
= f rte ae. 
R 


The last term is independent of v and tends to 0 as h + 0 by dominated conver- 
gence. 


SOLUTION (Exercise 4.5.12). 


By Tonelli’s theorem and the integrability assumptions 


[. f me-siisearas=(f moat) (f istolar) < oo 


This implies that, for ¢-almost all t, 


[messi s)|ds <0. 


The integral Te h(t — s) f(s) ds is therefore well defined for ¢-almost all t. Also 


| la(t)| at 
R 


that is, g is integrable. 


= s) f(s) ds} dt 


[fms s)|dtds <o, 


IA 


SOLUTION (Exercise 4.5.13). 
We have 


[( [re — s) f(s) as} er ie 
= i [i h(t — s)e" 2-9) f(s)e-2"*™* da dt 
= i faye ™ ( is A(t — sjeraant—a) ds = h(v)f(v), 
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by Fubini’s theorem, which is applicable here because the function 
(t, 8) + [a(t — 5) f(s)e*""| = [a(t — 5) F(s)| 


is integrable with respect to the product measure dt x ds (Exercise 4.5.12). 


SOLUTION (Exercise 4.5.14). 
21,>00(dz) : 


SOLUTION (Exercise 4.5.15). 

The function inf(f,, f) is bounded by the (y-integrable) function f (this is where 
the non-negativeness assumption is used). Moreover, it converges to f. Therefore, 
by dominated convergence, limptoo f x inf(fas f) du = i yf du. The rest of the 
proof follows from 


[lms am f tuaut f fan f inte tau. 


SOLUTION (Exercise 4.5.16). 

Let I be a finite interval of R. Observe that [, e?’"*dx = 0 if and only if the length 
of J is an integer. Let now I x J be a finite rectangle. It has Property (A) if and 
only if f fy, ,e™*™@ da dy = fre%™da x f,e%™dy = 0. (This is where we use 


Fubini.) Now 
I/ creas dy = ff e+) dx dy 
A UR_ An 
K 
- ae eit (2+) dr dy = 0, 
n=1 ee 


since the A,,’s form a partition of A and all have Property (A). 


SOLUTION (Exercise 4.5.17). 


te — ( b) 
——— dt = | te 7) | dt 
[ eee-f (& 


n=0 
+oo 


+coo 
1 
= te +) dt = 5° ——___. 
as Se a 


n=0 
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where the second equality is justified by Tonelli’s theorem applied to the product 
of the Lebesgue measure by the counting measure. 


SOLUTION (Exercise 4.5.18). 
If y £ 0, the function x + f(x,y) is continuous on [0,1] and therefore Lebesgue 
integrable, being bounded. We have 


1 
a 1 

fede= (S25) = 

/, (79) aay ae ee ee 


For y = 0, Sioa f(a, 0) dx = Sioa +, dx = +00. Therefore, 


f(x,y) dx = —,,l-ae, 
[0,1] y) Ley 
and ' 
T 
fe.y)ar) dy = | —~dy=-. 
[, (f., ai te 4 
Observing that f(x,y) =—f(y, xv) we obtain 


| ( fea) ay) dx = —* 
(0,1) \J [0,1] 4 


f cannot be integrable on [0, 1]?, otherwise the two integrals would be the same, 
by Fubini’s theorem. 


SOLUTION (Exercise 4.5.19). 

If )>,, |an| < 00, there exists no such that if n > no, |an| < 1. In particular, for 
n > No, |dn|? < |an|. Therefore ¥7 5... |@nl? < donsng lanl < 00, from which it 
follows that 57>), |an|? < oo. 


Check for 
updates 


Chapter 5 


From Integral to Expectation 


Probability theory is from a formal point of view, a particular chapter of measure 
and integration theory. Since the terminologies of the two theories are different, 
we shall first proceed to the “translation” of the theory of measure and integration 
into the theory of probability and expectation. 


5.1 Translation 


Recall the probabilistic trinity, the triple (0,7, P), where P (the probability) is a 
measure on the measurable space (Q,F) with total mass P(Q) = 1. 


Most of the results of the present section follow from those of the previous 
chapter by a mere change of notation: X ~ 0, ¥ ~ F, uy ~ P and f ~ X, so 
that for instance 


J fe uaz)~ [ x) Plan, 


(Of course, the reader is aware that the “X’s” in both sides are of a different 
nature. But this notational collision will not happen any more in the sequel.) 


Definition 5.1.1 A measurable function X from (Q,F) to a measurable space 
(E,€) is called a random element with values in (E,€) (or in E, for short, when 
the context is unambiguous). 


When (£,€) = (IR, B(R)) or (R,B(R)), X is also called a random variable 
(R.V.) (real r.v. if FE = R, extended r.v. if E = R). If (E,€) = (R”, B(R")), X 
is called a random vector (of dimension n), and then X = (Xj,...,Xn) where the 
X; are random variables. A complex random variable is a function X :Q — C of 
the form X = Xp +iX, where Xp and _X;, are real random variables. 
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If X is arandom element with values in (£,€) and if g is a measurable function 
from (E,€) to (R, B(R)), then g(X) is, by the composition theorem for measurable 
functions (Theorem 4.1.11), a random variable. 


Since a random variable X is a measurable function, we can define, under 
rather general circumstances, its integral with respect to the probability measure 
P, called the expectation of X. Therefore 


E[X]:= | X(w)P(dw). 
Q 
Recall the construction of the integral given in Section 4.2 in the special case of a 
probability. First, if A € F, 
E{1a] := P(A) 


and, more generally, if X is a simple random variable, that is, X(w) = 7%, a;1a,(w) 
where N € N,, a; € Rand A; € F (1 <i< N), then 


E[X] := dD iP(Ai), 


For a non-negative random variable X, the expectation is defined by 


BX] := lim E|X,], 
ntoo 
where {X,}n>1 is any non-decreasing sequence of non-negative simple random 
variables that converges to X. 


This definition is consistent, that is, it does not depend on the approximating 
non-decreasing sequence of non-negative simple random variables admitting X 
for limit. When X is of arbitrary sign, the expectation is defined by E[X] := 
E|Xt)—E[X7] if E[X*] and E[X7] are not both infinite. If ELX*] and E[X7~] are 
infinite, the expectation is not defined. If E||X|] < oo, X is said to be integrable, 
and then EX] is a finite number. 


The basic properties of expectation follow from the general case of the integral 
with respect to an arbitrary measure. These are linearity and monotonicity: If X1 
and X»2 are random variables with expectations, then for all 1, A2 € R, 


BX. kl = AB ELS; 


whenever the right-hand side has meaning (i.e., is not an co — oo form). Also, if 
X, < Xo, P-a.s., then 


E|X,] < ELX)]. 
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It follows from this that if E[X] is well defined, then 
|E[X]| < E[|X|]. 


Given a sequence {X,,},>1 of random variables, one seeks conditions guaran- 
teeing that, provided the limits thereafter exist, 


lim E [Xq] = E im x,| , (5.1) 


The next theorem (monotone convergence theorem) is, again, nothing but a 
rephrasing of the general result of the previous chapter (Theorem 4.3.3) in terms 
of expectations. 


Theorem 5.1.2 Let {Xn}n>1 and X be real random variables such that 
(i) PO ps5 Mp S XC) = Il, cxnal 
(ii) IPOS, S Xan = 1) jor alll m > 1. 

Then (5.1) holds true. 


Finally, we have the dominated convergence theorem, which is a rephrasing of 
Theorem 4.3.5 in terms of expectations: 


Theorem 5.1.3 Let {Xn}n>1 and X be real random variables such that 
@) PUiing. 2X = 20) = 1, cad 


(ii) there exists a non-negative real random variable Z with finite expectation 
such that P(|X,| < Z) =1 for alln > 1. 


Then (5.1) holds true. 


5.2 The Distribution of a Random Element 


Definition 5.2.1 Let X be a random element with values in (E,€). Its (probabil- 
ity) distribution is, by definition, the probability measure Qx on (E,E), the image 
of the probability measure P by the mapping X from (Q,F) to (E,€), that is, 


Qx(C)=P(XEC) (CE). 


EXAMPLE 5.2.2: DISTRIBUTION OF X +a. Let X be a random vector with 
values in R™ and distribution Qx, and let a € R™. The distribution Qx4. of the 
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random vector X + a is given by 


Qx+a(C) = P(X +a€C)= P(X €C—a)=Qx(C—a) (CE BR”). 


In particular, for all measurable non-negative functions f : E > R, 


E[f(X +a) = [ f() Qe(der — a). 


As a special case of Theorem 4.4.2, we have: 


Theorem 5.2.3 If g is a measurable function from (E,€) to (R, B(R)), then 


B[9(X)] = i OOS, 


this formula requiring that one of the sides of the equality be well defined, in which 
case the other is also well defined. 


In the particular case where (F,€) = (R, B(R)), taking C = (—o, a], 
Qx((—00, z}) = P(X = x) = F(z), 


where F’y is the cumulative distribution function (c.d.f.) of X, and 


Blg(X)] = I g(a) dFx(2), 


by definition of the Stieltjes-Lebesgue integral. 


In the particular case where (£,€) = (IR", B(IR")) and the random vector X 
admits a probability density fy (that is, if Qy is the product of the Lebesgue 
measure on (IR”, B(IR")) with the function fy), Theorem 4.4.4 tells us that 


Blo(X)] =f ole)fx(e) ae. 


The following result is an example of the efficiency of Tonelli’s theorem. 


Theorem 5.2.4 (The telescope formula) For any non-negative random vari- 
able X, we have the so-called telescope formula 


FIX] = fo ae 
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Proof. This follows from Tonelli’s theorem applied to the product measure £ x P. 
Indeed, 


E[X]=E£ i Lexsc} a] = [ E [1px>2}| dz = fu — F(a)|dz. 


5.3 Characteristic Functions 


Recall that the characteristic function y : R¢ > C of a real random vector X € R4 
is defined by 


plu) := EB le") (u € R*). 


Theorem 5.3.1 Let X € R@ be a random vector with characteristic function y. 
Then (Paul Lévy’s formula) for all 1 < j <d, all aj,b; € R4 such that a; < bj, 


tujaj e tub; 


: 1 ie me e - 
tha oe | af (=) p(t1,..., Ua) duy ++: dug 


j=l 


yall 
I (Ftene Or b;} ats 2) : 
fail 


Proof. We prove this result in the univariate case. The multivariate case is a 
straightforward adaptation of it. Let X be a real-valued random variable with 
cumulative distribution function F' and characteristic function y. We show that 
for any pair of points a,b (a < b), 


= IE 


iL +e eo tua _ ev iub 1 


Ee = a du = £ | (St or b} Itecx<n) : (x) 


For this, write 


1U 


1 c qua __ iub FOO 
= =| a (/ ee aF()) du 
27 J_. iu ann 


7 i —-_ iue g ar(a) =f vel2) are) 
= De = . is e€ U r)= a cl £) 5 
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where 


1 +e e iua PZ e tub ; 

r — +1uxr 

W(x) : ; r e'™? du. 
TT LU 


The above computations are justified by Fubini’s theorem. The conditions of this 
theorem are satisfied since, observing that 


—-e 


eT iua — e7tub b , 
= / e ““dx| < (b—a), 
iu 4 

we have 
tC +00 e tua __ e tub 
/ : - e™") dF (x) du 
—c oe) wu 

Le too e tua __ e iub 

-| / - dF(x) du 

= pet) gs iu 


< ff b-wareydn= 20-0 <00. 


Sele) a antisymmetric, [ is spelau) du = 0, and therefore 


u —e 


Since the function u— 


W,(z) = 1 - sin u(@ — a) — sin u(x — b) du 
20 Je Uw 
1 f° sin u 1 ste) sinu 
== u-— du. 
2a —c(z—a) U 2m —c(x—b) 


: i 0. sai 3 : ‘ . 
The function cH [f° #4 du = s™¥ is uniformly continuous in c and tends to 
Oo wu -c Uu 


Foo sinw dy, = im asc t+oo. Therefore the function (c,z) > W,(x) is uniformly 


bounded. Moreover, in view of the above expression for V,, 


0 if x<aorx>b 
limW,(z):=W(rz)= 4 4 if t=aorr=b 
rs 1 if a<a<b. 


Therefore, by dominated convergence, 
+00 
_ ®,= i _ W(x) dF(a) 


+oo it 
= / W(r) dF (x) =E | (Gt or 6} + 1ecx<n) . 


co 
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Note that, in the univariate case, denoting by F’ the cumulative distribution 
function of the random variable X, 


E (Gta or o} + ifecx<n) = roe) 7 Toes), 


so that formula (x) takes the perhaps more familiar form 


_ _ +e ,—iua _ ,—iub 
F(b)+F(b-) | F(a)+F(a-) _ lim & Sou) des. 
2 2 ct+oo 2m J_. qu 


Corollary 5.3.2 The distribution of a random vector of R¢ is uniquely determined 
by its characteristic function. 


Corollary 5.3.3 If the random variable X admits a probability density f and if 
moreover its characteristic function ip is integrable, then 


f(a) = % f*2 puede. (5.2) 


22 J—oo 


Proof. With f defined as in (5.2), we have, by Fubini, 


— Ie 


ae —iub 
sie | oi "apa Pra), 
-_ iu 


by Paul Lévy’s inversion formula. This proves that f is a probability density of 
Xx. 


The next result says in particular how under certain conditions of integrability, 
the moments of a random variable can be extracted from its characteristic function. 
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Theorem 5.3.4 Let X be a real random variable with characteristic function w, 
and suppose that E'||X|"| < oo for some integer n > 1. Then for all integers 
r<n, the r-th derivative by of w exists and is given by 


ON = 7H) Xe | (5.3) 
and in particular E [X"| = OP Moreover, 
Wu) = Ure GE IX"] + SP en(u), (5.4) 


where limntoo En(u) = 0 and |en(u)| < 3E [|X|"]. 


Proof. First we observe that for any non-negative real number a, and all integers 
r<n,a’ <1+<a” (indeed, if a <1, then a” < 1, and if a> 1, then a” < a"). In 
particular, 

E |X|] < E(L+|X|"] =1+ E[|X|"] <0. 


Suppose that for some r < n, 
wo (u) =('R [xTe™*] : 
In 


pO (u+h)— yOu) 


i(uth)X _ QiuX 

; =iE [xr : | 
; thx __ 1 

=7E [xve*e | 


the quantity under the expectation sign tends to X’t!e’"* as h + 0, and moreover, 
it is bounded in absolute value by an integrable function since 


2 be eithX | 


ihx 
XT eitx e = 
h 


a) digae 
- [six] 
(For the last inequality, use the fact that |e’ — 1|? = 2(1—cosa) < a?.) Therefore, 
by dominated convergence, 
vu +h) = vu) 
h 


(r+1) = |j 
el 


= 1 = TB [rte] 
i 3 


ux el — 
=? E |lim X'e'* —__ 
h-0 


Equality (5.3) follows since the induction hypothesis is trivially true for r = 0. 
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We now prove (5.4). By Taylor’s formula, for y € R, 


k . 
e” = cosy +isiny = > Go F tensa +isin(62y)) 
nt 


for some 6), 02 € [-1, +1]. Therefore 


n—-1 /. ‘ 
: iuxX )P iuxX )" 2a 
eitX 5 ( i ) + ( al ) (cos(@;uX) + isin(O,uxX)) , 


where 6, = 01(w), 62 = 62(w) € [—1, +1], and 
B [o] =o xt, ep} tp entw)) 


where 
é,(u) = B[X” (cosO,uX + isin d,uX — 1)] . 


Clearly |e,(u)| < 3£ [|X|]. Also, since the random variable 
X” (cos #;uX + isin 6,uX — 1) 


is bounded in absolute value by the integrable random variable 3|X|”" and tends 
to 0 as u — 0, we have by dominated convergence lim,-,9 €n(u) = 0. 


Theorem 5.3.4 can be extended to random vectors, with a proof similar to that 
of the univariate case. We just quote the formula giving the mixed moments of a 
random vector in terms of its characteristic function: 


Let X =(X,--- , Xa) € R@ be a random vector with characteristic function 
plu) = Ble] (w= (uy,--+ stg). 


Theorem 5.3.5 Suppose that E[|X;|"] < co (1 <i < d) for somen > 1. Then 


for all v = (™,--- ,v*) such that 4, +---+v1 <n, the partial derivative 
Quite 
exists and 1s continuous, and 
Quite tut 


B[Xt...X%4) = (ess Ole (5.5) 


The proof is required in Exercise 5.7.11. 
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5.4 Independence 
This section revisits the notion of independence an rigorous and more general way 
than was done in the first three chapters. 


Recall that two events A and B are said to be independent if 
P(AN B) = P(A)P(B). 


More generally, a family {A;}ic7 of events, where J is an arbitrary index, is called 
an independent family if, for every finite subset J € J, 


(A 4) =|[P(4). 


jes jel 


Two random elements X : (Q,F) > (E,€) and Y : (Q,F) > (G,G) are called 
independent if for all CE €, DEG 


P(X €C}N{Y € D})=P(X EC)P(Y € D). 


More generally, a family {X;}jc7 (where J is an arbitrary index) of random elements 
X;: (Q,F) > (£;,€;) (i € I) is said to be independent if, for every finite subset 


Jel, 
(A {X;€ ©) _ Ee €C;) 


GET jEed 
for all Cj € E; (j € J). 


The o-fields F; and Fz on (2 are called independent if for any k,,k2 > 1, any 
Aj,..., An, € Fi, and any By,..., Br, € Fa, 


P(A,,..-,Ag,, Bi,---; Bro) = P(At,.--, An, )P(Bi,---, Br)- 


This definition extends in an obvious way to the independence of a finite number 
of o-fields. 


By definition, the o-field generated by a random element X with values in the 
measurable space (£,€) is the o-field o(X) generated by the collection of events 
{X €C} (CEE). 


Therefore a family {X;}ier, where J is an arbitrary index, of random elements 
X;:(0,F) > (E;,€;) (i € I), is an independent family of random elements if for 
every finite subset J € I, the family of o-fields {o(X;)}ier is independent. 

The next result says that the independence property is preserved when taking 
functions of the random elements and is an immediate consequence of the definition 
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of independent random elements. This “natural” result was used without further 
justification in the first three chapters. 


Theorem 5.4.1 If the random elements X and Y, taking their values in (E,€) 
and (G,G) respectively, are independent, then so are the random elements p(X) 
and w(Y), where p: (E,€) > (E’,€'), »: (G,G) > (G',G’). 


Proof. For all C’ € €’, D' € G’, the sets C = y!(C’) and D = y7!(D’) are in E 
and G respectively, since y and w are measurable. We have 


PuOeC AVIED =P CeO yY ep) 
=P(X €C)P(Y €D) 
= P(yo(X) €C’) P(W(Y) € D’). 


The above result is stated for two random elements for simplicity, and it extends 
in the obvious way to a finite number of independent random elements. 


In order to prove that two o-fields are independent, it suffices to prove that 
certain subclasses of these o-fields are independent. 


More precisely: 


Theorem 5.4.2 Let (0,F,P) be a probability space, and let S,; and S2 be two 
collections of events that are stable under finite intersections. If S, and Sz are 
independent, then so are a(S;) and o(S2). 


This is not proved in this book.t 


The next corollary brings us back to the elementary definition of independence 
of two random variables. 


Corollary 5.4.3 Let (Q,F,P) be a probability space on which are given two real 
random variables X and Y. For these two random variables to be independent, it is 
necessary and sufficient that for all a,b € R, P(X < a,Y < bd) = 
P(X <a)P(Y <5). 


Proof. This follows from Theorem 5.4.2, since the collection {(—oo, a]; a € R} is 
stable under finite intersection and generates B(R). 


Similar arguments lead to a similar result for several random variables. 


' See for instance Theorem 3.1.39 of [7]. 
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The Product Formula 


The independence of two random variables X and Y is equivalent to the factori- 
sation of their joint distribution: 


Q x,y) = Qx X Qy, 


where Q(x,y), Qx, and Qy are the distributions of (X,Y), X, and Y, respectively. 
Indeed, for all sets of the form C' x D, where CE €, DEG, 


Qix,yv)(C x D) = P((X,Y)€C x D)=P(X €C,Y € D) 
= P(X €C)P(Y € D) = Qx(C)Qy(D). 


In particular, by the Fubini—Tonelli theorem, 
Theorem 5.4.4 Let X and Y be independent random elements taking their val- 
ues in (E,€) and (G,G) respectively. Then for all g : (E,E) > (R,B(R)), 
h:(G,G) > (R,B(R)) such that E [|g(X)|] < co and E||h(Y)|] < co, or g > 0 
and h > 0, we have the product formula for expectations 


E[g(X)hY)] = E[g(X)] BAY) 


EXAMPLE 5.4.5: CONVOLUTION PRopuct. Let X and Y be two independent 
random vectors with values in R™ and with respective distributions Qx and Qy. 
We compute the distribution of the random vector Z := X + Y: 


P(ZEC)=P(X+Y €C)=Elic(X+Y)| 
= ee lo(x + y)Qx(dy)Qy (dy) 


_ a ( | _le(e+ N@x(av)) Qy (dy) 
= I . ( [ 7 1o-y(2)@x(ax)) Qy (dy) 


= [exe -n)av(au). 


that is, 
Qx(C)= f Qx(C-Qv(eu). 


This probability distribution is called the convolution product of Qx and Qy. 
In the scalar case m = 1, and with C := (—oo, z], we have the following version 
of the convolution product formula in terms of cumulative distribution functions 
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and Stieltjes-Lebesgue integrals: 


Fyl2) = f Fe(2—u) Pola). 


The next result generalizes Theorem 3.2.20, with a similar proof. 


Theorem 5.4.6 For the random vectors X1,...,Xa to be independent, a nec- 
essary and sufficient condition is that the characteristic function py of X = 
(X1,...,Xa) factorizes as 


px (ua, o6 ., Ua) = Ile@), 


jl 


where for all 1 < j < d, y; is a characteristic function. In this case, for all 
1<j<d, 9; = 4x,, the characteristic function of Xj. 


Proof. Exercise 5.7.7. 


5.5 Conditional Expectation III 


In Chapters 2 and 3, the theory of conditional expectation was developed in the 
discrete and the absolutely continuous cases respectively. This chapter now gives 
the theory of conditional expectation of a random variable with respect to another 
random variable when the joint distribution is arbitrary. 


We start with a preliminary observation. Let X and Y be two random vectors 
of dimensions p and n respectively, with the joint probability density fxy(z, y). 
Let the function g : R? x R” — Ry, be either non-negative or such that g(X, Y) 
is integrable. For any non-negative bounded function y : R” > R, we have 


E(E” (9X, VY) ¢(Y)] = Flo(X, Y)¢(¥)]. (5.6) 
Indeed, 
E [EY [9(X,Y)] oY) ]= EWY)eY)] = fan Vy) ey) fr (y) dy 
= = fan ae vy ie de) p(y) fy(y) dy 
= frm Ine 9(@, WOU) Fxv (a, y) de dy 


= Elg Ye th 
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In the discrete case, a similar computation yields a similar result. This suggests 
to adopt, in the general case (not necessarily discrete or absolutely continuous), 
(5.6) as a definition of conditional expectation, where X and Y are random ele- 
ments taking their values in spaces £ and F respectively that can be either discrete 
or some R*. This would include mixed cases of the type discrete/absolutely con- 
tinuous, but also more complex situations, such as the following: E = R? and 
F = R", X and Y admit a probability density function, but the couple (X,Y) 
does not admit a probability density function (for instance, in the univariate case, 


if Y = X?). 


We shall now give more generality to the study (but only in appearance) and 
take for the conditioned variable a real random variable Z (previously, we chose 


Z = G(X,Y)). 


Definition 5.5.1 A Y-measurable random variable is a random variable U of the 
form U = 9(Y) where yp: R" > R is a measurable function. 


Definition 5.5.2 Let Z and Y be as above, and suppose that Z is either non- 
negative or integrable. The conditional expectation E* [Z] is by definition the 
“essentially unique” variable of the form w(Y), where w is measurable, such that 
equality 

Elw(Y)U] = E [ZU (5.7) 


holds for any non-negative bounded Y -measurable real random variable U = y(Y). 


By “essentially unique” the following is meant: If there are two functions 7, 
and w, that meet the requirement, then w(Y) = wWe(¥Y) almost surely, that is, 
P(Wi(Y) = veo(¥Y)) = 1. We then say that y(Y) and y2(Y) are two “versions” of 
EY [Z]. 

Theorem 5.5.3 In the situation described in the above definition, the conditional 


expectation exists and is essentially unique. 


Proof. The proof of existence is omitted at this point (see Section 5.6). In practice, 
one is usually able to find “a” function w by construction, as the examples and the 
exercises will show. The uniqueness part that we now agree to prove will guarantee 


that it is “the” function w. 
Indeed, suppose that , and w2 meet the requirement. In particular, 
Elda(Y)e(Y)] = 2 a(Y)¢(Y)] (= £[Z¢(¥))) 


and therefore 


E[@i(¥) — 42(¥)) o(Y)] = 0 
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for a non-negative bounded measurable functions y : R” + R. Choosing y(Y) = 
Ltui(v)—v2(v)>0}, we obtain 


E [((Y) — ¥2(¥))leicvy—vevy>0}] = 0- 


Since the random variable (¢1(Y) — ¥2(Y))14u1(v)-v2()>0} is non-negative and has 
a null expectation, it must be almost surely null (Lemma 3.3.3). In other terms 
Wi(Y) —wv2(Y) < 0 almost surely. Exchanging the roles of w, and 2, we have that 
wi(Y) — Y2(Y) > 0 almost surely. Therefore w(Y) — w2(Y) = 0 almost surely. 


EXAMPLE 5.5.4: THE DISCRETE CASE REVISITED. If Y is a positive integer- 
valued random variable, then 


E|Z1 
1A Yen} ni} 
YZ 1 
2) = ey 
where, by convention, aE reul — =0 when P(Y =n) =0. 


Proof. We must verify (5.7) for all bounded measurable y : R > R. The right- 
hand side is equal to 


EZ liven 
E (= sate y. ») (THK nw) 
E|Zlry=n E|Z1v<n 
=E 2, Heel dont = 2, a 2 a(n) Ellyy=n] 
= eek (n)P(Y =n) = = 2 nyly(n) = E[Zy(Y)]. 


EXAMPLE 5.5.5: THE ABSOLUTELY CONTINUOUS CASE REVISITED. Let X 
and Y be random vectors of dimensions p and n respectively, admitting the joint 
probability density fxy(x,y). Let g: R?’t" — R be a measurable function, and 
suppose that the random variable Z = g(X,Y) is integrable. The conditional 
expectation of Z given Y is the random variable w(Y), where 


vy) =f, g(x,y) fx? (a)dz. 
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Proof. We first verify that w(Y) is integrable. We have 


wl < f late. ie"@)ae 


and therefore 


eWwor=[ womans [Cf lolenise*ear) trond 
= | lo(a, wl fx “(@) fy (y)dady 
RP JR" 
= ff love nlixyr(esu) ae dy = Bllo(X,¥)] = Bl\ZI] < 00, 
We check that (5.7) is true, with U = y(Y) bounded. The right-hand side is 
Ewe = [ velurandy 
=[ (fem wiar) ela) f-day 
=f f sane" @ fel) deay 
RP n 


=f, [se netercen) daz dy 
= Elg(X,Y)e(¥)] = B1Ze(¥)) 


EXAMPLE 5.5.6: MIXED CASE, I. We shall consider the situation, often encoun- 
tered in practice, where X is a random vector of dimension p and where Y takes 
its values in N,. We denote P(Y = k) by z(k). We suppose that for all k > 1, 
there is a probability density function f; such that 


P(X € A|Y =k) = [a dx (Ae B(R?)). 


Then, for any function g : RP x N, > R that is non-negative or such that g(X, Y) 
is integrable, we have 


EN g(XY)|= 4%), 
where 


w(k) = 1 gles) fely)ey. 
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The proof is similar to the proof when (X,Y) has a joint probability distribution 
and is left to the reader. 


EXAMPLE 5.5.7: MIXED CASE, II. We now treat the second type of mixed 
case, where the conditioning variable Y is a random vector of dimension n, X is a 
N,-valued random variable, with the joint distribution of (X,Y) given by 


P(X =k)=nr(k) (k>1) 


and 


Pe AX =k) = f fy)dy (21,4 € BIR’). 
A 
For all k > 1,y € R”, let 


tx (kly) = m(K) ficly) 


fy(y) 
if fy(y) = ops, 7(k) fe(y) > 0, and m(kly) = 0 otherwise. We let the reader verify 
that for all g : N x R" > R such that E[|g(X,Y)| <0, 
E*(9(X,Y)| = ¥(Y), 


where 


vy) = Se alk, yt xiv (kly) - 


k>1 


We now list, in the more general setting, the main rules that are useful in 
computing conditional expectations. 


Let Y be a random variable, and let 7, 2, Zz be integrable (resp. non-negative 
finite) random variables, \1,A2 € R (resp. € R,). 


Theorem 5.5.8 Rule 1 (linearity) 


EB’ Dats 2 ea = ne (AA (Za) 


Proof. We consider only the integrable case. The non-negative case follows, 
mutatis mutandis. We must check that \;EY[Z,] + A2E* [Zo] is Y-measurable 
(which is part of the definition of a conditional expectation with respect to Y) and 
that for all bounded Y-measurable random variables U 


E[ALEY [Xi] + AE [X4)/)U] = El Z1 + A2Z2)U). 
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This follows immediately from the definition of EY [Z;], which says that E[EY [Z,|U 
= E[Z,U] (i = 1,2). 


Theorem 5.5.9 Rule 2. If Z is independent of Y, then 


Proof. (Non-negative case.) The constant E[Z] (as any constant) is Y-measurable. 
Moreover, for all bounded Y-measurable random variable U, E[E|Z]U] = E[ZU]. 
In fact, Z and U are independent and therefore E[ZU] = E[Z|E[U]. 


Theorem 5.5.10 Rule 3. Jf Z is Y-measurable, then 


Ae 


Proof. (Non-negative case.) In fact, Z is Y-measurable by hypothesis and 
E|ZU] = E[ZU]! 


Theorem 5.5.11 Rule 4. If Z, < Z P-a.s. Then 
E (Zoe (2) os: 


In particular, if Z is a non-negative random variable EY |Z] > 0, P- a.s. 


Proof. We consider only the integrable case. For any bounded Y-measurable 
random variable U, 


E|E*[Z,]U] = E[Z\U] < E[Z.U] = E[E* [ZU]. 
Therefore 
E((E* [2] — E*[Z,])U] = 0. 


In particular, 
E((E* [Zo] — EX (Zi) lev zjceyizyy] = 0- 


Since the left-hand side is non-positive, it follows that P(EY[Z.] < EY[Z,)). 


Theorem 5.5.12 Rule 5. Let Y; and Y2 be random variables, and let Z be either 
integrable or non-negative. Then 


BY (EY (7) = BY(Z). 


Proof. We just have to check that EY?[E™:*2[Z]] is a version of E%2[Z]. Since it 
is a Yo-measurable variable it remains to show that it satisfies 


E[E™[E"*(Z]]U] = E(ZU], 
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for any bounded (resp. bounded non-negative) Y3-measurable variable U. Since 
such a variable is a fortiori (Yi, Y2)-measurable, 


EE“ [Z]U] = E[ZU]. 


Moreover 


RLM (ZO = BLE (ZIO), 
by definition of EY?[E%)*[Z]]. 


Theorem 5.5.13 Let Y be a random vector and let Z be of the form Z = VZ', 
where V is a Y-measurable bounded (resp. non-negative finite) random variable, 
and Z' is an integrable (resp. non-negative finite) random variable. Then 


EY (VZ']|=VE*[Z). 


Proof. We consider only the integrable case. We observe that VE*[Z] is Y- 
measurable, and it remains to prove that for all bounded Y-measurable random 
variables U, 


E(VZ'U] = E[VE* [ZU]. 
But, since VU is bounded, by definition of EY [Z’], 
E|VE*[Z'U] = E[VZ'U]. 


The theorems allowing interversion of limit and integral (monotone convergence 
theorem and dominated convergence theorem) have conditional versions. 


We start with the monotone convergence theorem: 


Theorem 5.5.14 Let X be some random vector and let {Yn}n>i be a P-a.s. non- 
decreasing sequence of non-negative random variables converging P-a.s. to the ran- 
dom variable Y. Then {E*[Y,]}n>1 is a P-a.s. non-decreasing sequence of random 
variables converging P-a.s. to E*{Y]. 


Proof. By monotonicity of conditional expectation, {E*[Y;]}n>1 is a P-a.s. non- 
decreasing sequence of X-measurable random variables. In particular, there exists 
a P-a.s. limit W, X-measurable, of this sequence. By monotone convergence, for 
any bounded non-negative X-measurable random variable U, 


lim E[¥,U] = E[YU], 


and 


lim E[E*[Y,]U] = E[WU]. 


ntoo 
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Therefore, since E[Y,U] = E[E*[Y,]U] for all n > 1, E[YU] = E[WU]. This 


being true for all bounded X-measurable random variables U, W = E[Y |X]. 


We now turn to the conditioned version of the dominated convergence theorem: 


Theorem 5.5.15 Let X be some random vector, and let {Y;}n>1 be a sequence 
of random variables converging P-a.s. to the random variable Y, and such that 
lY,| < Z for some integrable random variable Z. Then {EX[Y,]}n>1 converges 
Pegs to mW], 


Proof. Let Wy := sup,,>7|Ym — Y|. The sequence {W,,}n>1 decreases P-a.s. to 
0. We have 


|E*|¥n] — E*[Y]] = |E*[¥n — Y]| 
< E*|¥, — Yl] < E*[Wil. 


The non-negative sequence {E*[W,,]}n>1 decreases P-a.s. (rule 4). Let H > 0 be 
its limit. Then 
0<|E[H]| < E [E* [Wal] = E[Wal, 


where the latter quantity tends to 0 by dominated convergence (because 0 < W, < 
2Z). Therefore E{H] = 0, which implies that P(H = 0) = 1 since H is P-a.s non- 
negative. 


5.6 General Theory of Conditional Expectation 


We shall need later a more general and abstract theory of conditional expectation. 
Previously, the conditioning was with respect to random variables or vectors. We 
now condition with respect to o-fields. 


Definition 5.6.1 Let Y be an integrable (resp. finite non-negative) random vari- 
able, and let G be a sub-o-field of F. A version of the conditional expectation of Y 
given G is any integrable (resp. finite non-negative) G-measurable random variable 
Z such that 

E\YU] = E[ZU] (5.8) 


for all bounded (resp. bounded non-negative) G-measurable random variables U. 


Theorem 5.6.2 Let Y andG be as above. There exists at least one version of the 
conditional expectation of Y given G, and it is essentially unique, that is, if Z' is 
another version of the conditional expectation of Y given G, then Z = Z', P-a.s. 
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There will be no problem in representing two versions of this conditional expec- 
tation by the same symbol, since, as we just saw, they are P-almost surely equal. 
We choose the symbol E[Y|G] or E¥[Y] indifferently. From now on we say: E9[Y] 
(or E[Y|G]) is the conditional expectation of Y given G. The defining equality 
(5.8) reads 

EVYU] = E[E9[Y|U] 


for all bounded (resp. bounded non-negative) G-measurable random variables U. 


Proof. Uniqueness. The integrable case will be treated, the other case being 
similar. First observe that 


0= E[ZU] - E[Z'U] = E[(Z — Z'\U) 
for all bounded G-measurable random variable U. In particular, with U = lyzszn, 
El(Z — Z')1lyzszy] = 0. 


Since the random variable in the expectation is non-negative, it can have a null 
expectation only if it is P-a.s. null, that is if P-a.s., Z << Z’. By symmetry, P-a.s., 
Z > Z', and therefore, as announced, Z = Z', P-a.s. 


Existence. We do this for the non-negative integrable case, the general case 
following easily from this special case. Consider the measure v on (Q,G) defined 
by 


(A) = f vaP (Aeg). 


It is finite (resp. o-finite) since Y is assumed integrable (resp. finite non-negative). 
Moreover, if P(A) = 0 then v(A) = 0. Therefore the measure pz on (Q,G) that is 
the restriction of P to (Q,G) is absolutely continuous with respect to v, so that, by 
the Radon—Nikodym theorem (Theorem 4.4.6), there exists an integrable (resp. fi- 
nite non-negative) random variable of (Q, G), that is, an integrable (resp. finite non- 
negative) 

G-measurable random variable Z of (Q,F), such that 


(4) = | zaP (A€G). 


In particular, 


[uvar=[uzap 
Q Q 


for all bounded (resp. non-negative bounded) G-measurable random variables U. 
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A Special Case 


Let 
G :=a0(X), (5.9) 


where X = (Xj,...,Xy) is an arbitrary random vector defined on (Q, F) and 
a(X) is, by definition, the smallest o-field that contains all the sets of the form 
{X € C} where C € B(RY). In this situation, we adopt the notation E*[Y] 
for E9[Y] (or E[Y|X] for E[Y|G]), and we call this (equivalence class of) random 
variable(s) the conditional expectation of Y given X. 


Theorem 5.6.3 Let X be a random vector with values in the measurable space 
(R*, B(R*)). A random variable Z : (Q,F) — (R,B(R)) is o(X)-measurable 
if and only if there exists a measurable function g : (R*, B(IR*)) > (R, B(R)) such 
that Z = g(X). 


Proof. The “if” part is just the stability of measurability under composition (The- 
orem 4.1.11). For the necessity, first observe that this is true of simple random 
variables. It therefore remains to show that it is true for a non-negative ran- 
dom variable Z (from which the general case straightforwardly follows). Such 
a random variable is the limit of a non-decreasing sequence {Z,}n>1 of non- 
negative simple random variables of the form g,(X) for some measurable func- 
tion g, : (IR*,B(R*)) > (R,B(R)). Let M be the (measurable) set on which the 
sequence {gp}n>1 admits a limit. Define g(x) = limg,(x)1,,(x) (a measurable 
function). For each w, Z(w) = lim g,(X(w)), which implies that Z(w) € M and 
that Z(w) = lim g,(X(w)) = g(X(w)). 


Properties of the Conditional Expectation 
The main rules that are useful in computing conditional expectations will be given 
once more, but this time in the general abstract framework. 


Let G be a sub-o-field of F, and let Y, Y;, Yo be integrable (resp. non-negative 
finite) random variables, 1, A2 € R (resp. € R,). 


Theorem 5.6.4 Rule 1. (linearity) 


E9[M Yi ar A2Y9] = MEY] SP 2EY [Yo] : 


Proof. We consider the integrable case. We must check that \; EG|.X1|+A,EG[X9] 
is G-measurable (which is part of the definition of a conditional expectation with 
respect to G) and that for all bounded G-measurable random variables U 


ELA BG[Xi] + A2BG[X1)/)U] = EY + A2¥a)U]. 
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This follows immediately from the definition of EG[X;], which says that 
BEG[XJU] = ELYU] (i = 1,2). 


Theorem 5.6.5 Rule 2. If Y is independent of G, then 


Proof. We consider the integrable case. First recall that the constant E[Y] is 
G-measurable. It remains to prove that for all bounded G-measurable random 
variables U, E[E|Y]U] = E[YU]. This is the case since Y and U are independent 
and therefore E[YU] = E[Y]E[U]. 


Theorem 5.6.6 Rule 3. If Y is G-measurable, 


| =e 


Proof. We consider the integrable case. We must check that Y is G-measurable 
and that E[YU] = E|YU]. 


Theorem 5.6.7 Rule 4. If Y; < Y2 P-a.s., then 
E?(¥,| < B7|y,| P-a.s. (5.10) 


In particular, if Y is a non-negative random variable, then E9[Y] > 0, P-a.s. 


Proof. We consider the integrable case. The non-negative case follows mutatis 
mutandis. For any bounded G-measurable random variable U > 0, 


E(E*|Y,]U] = E[Y,U] < B[Y2U] = E[ES[Y2|U]. 


Therefore 
BU(E% (Yo) — ES(Yi))U] = 0. 


Taking U = lygerys}<z9[y9]}, We obtain (5.10). 


Theorem 5.6.8 Rule 5. (successive conditioning). Let H be a sub-o-field of F 
such that H CG. Then 
BE || = 24 ie 


Proof. We just have to check that E”[E9[Y]] is a version of E*[Y]. Since it is 

an H-measurable variable, it remains to show that it satisfies the equality 
E(E™E*[Y]U] = BYU], 

for any bounded (resp. bounded non-negative) H-measurable variable U. Since 


such a variable is a fortiori G-measurable, 


E|[E9[Y]|U] = E[YU]. 
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Moreover, 


E[E™(E*|Y]JU] = E(E*[v]}U], 
by definition of E”|B9[Y]]. 


Theorem 5.6.9 Let Y be of the form Y = VZ, where V is a G-measurable 
bounded (resp. non-negative finite) random variable, and Z is an integrable (resp. 
non-negative finite) random variable. Then 


E9[VZ] =VE9[Z]. 


Proof. We consider the integrable case. We observe that V E9[Z] is G-measurable, 
and it remains to prove that for all bounded G-measurable random variables U, 


E|VZU] = E|VE%[ZU]. 
But, since VU is bounded, by definition of E9[Z] 


? 


E(VES[Z|U] = E[VZU]. 


The theorems allowing interversion of limit and integral (monotone convergence 
theorem and dominated convergence theorem) have conditional versions. 


We start with the monotone convergence theorem: 


Theorem 5.6.10 Let G be a sub-o-field of F, and let {Yn}n>1 be a P-a.s. non- 
decreasing sequence of non-negative random variables converging P-a.s. to the ran- 
dom variable Y. Then {E9[Y;]}ns1 is @ P-a.s. non-decreasing sequence of random 
variables converging P-a.s. to E9[Y]. 


Proof. By monotonicity of conditional expectation, {E9[Yn]}ns1 is a P-a.s. non- 
decreasing sequence of G-measurable random variables. In particular, there exists 
a P-a.s. limit W, G-measurable, of this sequence. By monotone convergence, for 
any bounded non-negative G-measurable random variable U, 


lim BYU] = EU, 


and 


lim E[ES[Y,]U] = E[WU]. 


Therefore, since E[Y,U] = E[E9[Y,]U] for alln > 1, E[YU] = E[WU]. This being 
true for all bounded X-measurable random variables U, W = E'[Y | G]. 
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We now turn to the conditioned version of the dominated convergence theorem: 


Theorem 5.6.11 Let G be a sub-o-field of F, and let {Yn}n>1 be a sequence of 
random variables converging P-a.s. to the random variable Y , and such that |Y;,| < 
Z for some integrable random variable Z. Then {E9[Yn]}ns1 converges P-a.s. to 
PAD 


Proof. Let W, := sup,,,|Ym — Y|. The sequence {W,,}n>1 decreases P-a.s. to 
0. We have 


|E°(Yn] — E9[Y]| = |£%[¥, — Y]]| 
< E*||¥, — Yl] < £°[W,]. 


The non-negative sequence {E¥[W,,]}n>1 decreases P-a.s. (rule 4). Let H > 0 be 
its limit. Then 
0< |B[H]| < B[E*[W,]] = B[W,], 


where the latter quantity tends to 0 by dominated convergence (because 0 < W,, < 
2Z). Therefore E|H]| = 0, which implies that P(H = 0) = 1 since H is P-a.s non- 
negative. 


The L?-theory of Conditional Expectation 


This paragraph gives another approach to conditional expectation that avoids the 
use of Radon—Nikodym’s theorem (that was admitted in this book). 


Conditional expectation will be first defined for square-integrable random vari- 
ables in terms of projection from a Hilbert space onto a Hilbert subspace of the 
latter’. More precisely, let (0,7, P) be a probability space and let G be a sub-o- 
field of F. Denote by L3(F, P) and L2,(G, P) the Hilbert spaces of F-measurable 
(resp. G-measurable) square-integrable real random variables. Clearly, L2(G, P) is 
a Hilbert subspace of L2,(F,P), and therefore, one can define the projection of an 
F-measurable square-integrable variable X on L2(G, P), denoted by P9(X). From 
the general theory of projection (see Theorems A.0.12 and A.0.11), this random 
variable is the unique (in the L?-sense) variable Y such that 


(i) Y € L2(G, P), and 
(ii) (U, Y) 12(F,P) => (U, X) 12(F,P) for all U E L2(G, P). 


In other terms, P¥(X) is the unique (in the L?-sense) square-integrable random 
variable Y such that. 


2 See Appendix A for a review of Hilbert spaces. 
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(a) Y is G-measurable, and 


(b) E[UY] = E [UX] for all square-integrable G-measurable variables U. 


This shows that P¥(X) = E [X | G]. 


Starting from there, a proof of Theorem 5.6.2 is easy and left as an exercise. 


Nonlinear Regression 


We have previously obtained in Section 3.3 the best linear least-squares estimator 
of the square-integrable random variable Y in terms of the second-order random 
vector X = (Xj,...,Xy). We shall now obtain the best non-linear least-square 
estimator of Y in terms of X, that is to say the square integrable random vari- 
able of the form g(X), where g : RN — R is measurable, which minimizes the 
quadratic risk E[|Y — g(X)|?]. Whereas the best nonlinear estimator is chosen 
among all square integrable variables g(X), g : RY — R measurable, the best 
linear estimator is chosen among the variables g(X) where g(x) = ao +, a;X;. 
In particular if g(X) is the best nonlinear estimator, and Y is the best linear esti- 
mator E[|Y — 9(X)|?] < E[|Y — Y]]?. It is therefore theoretically advantageous to 
use a nonlinear estimator. However, as we have seen, the construction of Y only 
requires the knowledge of the covariance structure of the vector (Y, X), whereas 
the construction of g(X) requires the knowledge of the joint distribution of (Y, X). 


As we shall now see, the best nonlinear estimator is 
9(X) = B*[Y]. 


Since Y is square-integrable, and in particular integrable, the conditional expec- 
tation of Y given X is well is square-integrable. 


Theorem 5.6.12 Let X be a random vector and let Y be a square integrable 
random variable. Then, for all measurable g : R" — R such that g(X) is square 
integrable, 


(SP | eee) a 


Proof. Developing both sides of this inequality, we have to show that, 
E(E*|YP] — 2E(YE*[Y]] < Elg(X)"] — 2E[Y9(X)]. 
Since E*[Y] is a square-integrable function of X, 


E[E*|Y)’] = E[E*|YJE*[Y]] = EY E*|Y]]. 
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The left-hand side of the last inequality therefore equals —E[E*[Y]?]. But E[Y 9(X)] = 
E[E*[Y]g(X)] therefore have to show that 


—E[E*|Y)] < Elg(X)’] — 2E[E*[Y]g(X)]. 


But this is just 
El(g(X) — E*[y])"] > 0. 


If (Y, X) is jointly Gaussian, then the best linear estimator and the best non- 
linear estimator, of Y given X, coincide. 


Theorem 5.6.13 Let Y be a random variable and let X be an n-dimensional ran- 
dom vector. Suppose that (Y, X) is jointly Gaussian, and that X is non-degenerate 
(its covariance matrix is strictly positive). Then 


EX \V\=my +l yxbs (Xx —mx). 


Proof. Consider the random variable 
U=Y- (my + Ty xTyi (Xx = mx)) : 
We have E|[U] = 0 and 


B|U(X — mx)"] 
= E[(Y — my)(X — mx)*PyxT¥E[(X — mx)(X —mx)"] 
=Tyx +TyxT¥'Tx =Tyx -Tyx =0. 


Therefore U and X are uncorrelated. Since (U, X) is jointly Gaussian, this implies 
that U and X are independent. In particular (Theorem 5.5.13), 


Also, by linearity, 
EX[U] = EX[Y] — EX [my + TyxPx(X — mx)]. 
By (Theorem 5.5.10), 


EX (my + TyxTy (X = mx)| = My + [yxy (X = mx) . 
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5.7 Exercises 


Exercise 5.7.1. P(f(X) =0) =0 
Let X be a random vector of R? admitting the probability density function f. 
Show that P(f(X) =0) =0. 


Exercise 5.7.2. EXTENSION OF THE TELESCOPE FORMULA 
Let X be a non-negative random variable and let G : Ry — C be the primitive 
function of g: Ry > C, that is, for all x > 0, 


G(x) = G(0) + a g(u) du. 


Let X be a non-negative random variable with finite mean jz and such that 
E|G(X)] < oo. Show that 


E[G(X)] = G(0) + | gta POLS By da: 


Exercise 5.7.3. A FORMULA FOR MOMENTS 
Let X be a non-negative random variable with the probability density function f. 
Let r > 0 be such that E [|X|"] < co. Prove that 


E [X"] -{ rz” 'P(X > x)dz. 
0 


Exercise 5.7.4. INFINITE SUMS AND EXPECTATIONS 


In the first chapters, we have sometimes surreptitiously taken for granted that the 
expectation of an infinite sum of random variables is equal to the sum of their 
expectations. The result (to be proved) that justifies this, when it is true, is the 
following: 

(a) Let {Sp }n>1 be a sequence of non-negative random variables. Then: 


>. s, =>" =[S,]. 


n=1 


E 


(b) Let {Sp}n>1 be a sequence of real random variables such that >7,5, E[|Snl] < 


oo. Then: - 2 
> s, =) Esai. 


n=1 


E 
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Exercise 5.7.5. LAPLACE TRANSFORM 
Let X be a non-negative random variable. Prove that 


ee [eo = POC= 1). 


Exercise 5.7.6. LADDER RANDOM VARIABLES 
A real random variable X is called a ladder random variable if there exist a and h 
in R such that 

S > P(X =a+nh) =I. 


neZ 


Let y be the characteristic function of a real random variable X. Prove that if 
\y(to)| = 1 for some to € R, to 4 1, then X is a ladder random variable. 


Exercise 5.7.7. CHARACTERISTIC FUNCTIONS AND INDEPENDENCE 
Prove Theorem 5.4.6. 


Exercise 5.7.8. RADON-NIKODYM 

Let {P,}n>1 be a sequence of probability measures on (Q, F). Show the existence 
of a probability measure P on (Q,F) and of a sequence {f,}n>1 of P-integrable 
non-negative measurable functions such that for all A € F and all n > 1, 


Py(A)= f fale) Pla). 


Exercise 5.7.9. CONDITIONAL INDEPENDENCE 
Let A be some event of positive probability, and let P4 denote the probability P 
conditioned by A, that is, 

Pa(-) = P(-| A). 
The random variables X and Y are said to be conditionally independent given A 
if they are independent with respect to probability P4. Prove that this is the case 
if and only if for all u,v € R, 


P(A)Ele™* ee” 14] = Ele™* 14] Ee” 14]. 


Exercise 5.7.10. E[X]— E[Y] 
Let X and Y be real integrable random variables. Prove the following: 


BIX|- B= f (Px <ts¥)- PW <ts x) dt. 
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Exercise 5.7.11. MOMENTS OF GAUSSIAN VECTORS 
A. Give the proof of Theorem 5.3.5. 


B. Let X = (X,..., Xn)" be a centered (0-mean) n-dimensional Gaussian vector 
with the covariance matrix I = {o;;}. Show that 
EX, Xin, seit peel = = Ojrj2F jada +++ Tjondor > (5.11) 
Gt y-nJ2K) 


Ji <925++J2k-1<J2k 


where the summation extends over all permutations (j1,..., jax) of {t1,..-,¢ax} 
such that 71 < jo,..-,Jor-1 < jor. There are 1-3-5...(2k — 1) terms in the 
right-hand side of Eq. (5.11). The indices i,...,%2, are in {1,...,n} and they 
may occur with repetitions. Show that the odd moments of X are null, that is: 


ELX; 


apes 


JX; 


el 


=0, 


for all (41, a ton41) = {1, 2. cay Bt gaia 


Exercise 5.7.12. CONDITIONING BY THE SQUARE. 
Let X be a real random variable with probability density fy. Let h: R — R be 
a function such that h(X) is integrable. We prove that 


E[h(X)|X2] = h( VX?) RO) (VR) 


(Some people may find this result intuitive. Others will need a formal proof, which 
is given below.) 


Exercise 5.7.13. MIXED CASE: A SPECIFIC EXAMPLE 
Let Y be an N,-valued random variable, and let X be a real random variable of 
the form 

ASsY +e, 


where € is a random variable admitting a probability density f¢ and independent 
of Y. Let h: R > R be a measurable function such that E[|h(X)|] < co. Give 
the function w such that W(Y) = EY [h(X)]. 


Exercise 5.7.14. CONDITIONING BY THE SUM 
Let X, and X» be two integrable independent identically distributed random vari- 


ables. Show that 
X,+ Xo 


E*+%2(X)) = 2 


Exercise 5.7.15. MIN CONDITIONED BY MAX 


5.7. EXERCISES 211 


Let X, and X2 be two independent random variables uniformly distributed on the 
interval [0,1]. Compute £™*@1-*2) (min(X1, Xo]. 


Exercise 5.7.16. EXPONENTIAL DISTRIBUTIONS 

Let {S,}n>1 be a sequence of IID non-negative real random variables with ex- 
ponential distribution of parameter \. Define 7, = S;, Tz, = T, + So, ..., 
Troi = Th + Snii, etc. Let f : Ry — Ry, be a non-negative function. Give 
for m > 2 the conditional distribution of (T1,...,Zin) given Tj, = s. Show that it 
is the same as the distribution of the vector obtained by reordering (sUi,...,5Um), 
where (U;,...,Um) is a vector of m independent variables uniformly distributed 
on (0, 1]). Compute E [ew (7), 


Exercise 5.7.17. WILL THE SUN RISE NEXT DAY? 

At the beginning of time, God chose (a probabilist once claimed) a number p at 
random in the interval [0, 1] and devised a biased coin with probability p for heads. 
Since then, He tosses the same coin once every morning and decides to let the sun 
rise this day if and only if the result is heads. The common belief, which will be 
taken to be true in this exercise, is that the sun has never failed to rise in the n 
days separating us from the beginning of time. What then is the probability that 
the sun will rise the next day (n + 1th)? 


Exercise 5.7.18. 

Let X and Y be two real random variables, and let h : R — R be one-to-one 
and onto. Show that for all v: R > R such that E[|v(X)|] < 00, EY [u(X)] = 
E7|v(X)], where Z = h(Y). 


Exercise 5.7.19. THE CONDITIONAL VARIANCE FORMULA 
Prove the following formula 
Var (X) = E [Var (X|Y)] + Var (E [X|Y]), 
where X is a square-integrable random variable, and 
Var (X|Y) := E [(X - E[X|Y])’ |Y] 
is the so-called conditional variance of X given Y. 


Exercise 5.7.20. CONDITIONAL JENSEN’S INEQUALITY 

Let I be a general interval of R (closed, open, semi-closed, infinite, etc.) and let 
(a, b) be its interior, assumed non-empty. Let y : J > R be a convex function. Let 
X be an integrable real-valued random variable such that P(X € I) = 1. Assume 
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moreover that either y is non-negative, or that y(X) is integrable. Prove that for 
any sub-o-field G C F 


Elp(X)|9] = e(F[X|9)). 


5.8 Solutions 


SOLUTION (Exercise 5.7.1). 


P(f(X) = 0) =f [1rrcx)=0} = a 1g ¢@)=0} f (2) dx = [ ow = 0). 
SOLUTION (Exercise 5.7.2). 


E[G(X)] =G(0) +B | | oo, au| =% | |  Winies au| 


= G0) + [ Ou ene moe [scope > wan. 


where the third equality is due to Tonelli’s theorem applied to the product measure 


Px. 


SOLUTION (Exercise 5.7.3). 
A direct consequence of Exercise 5.7.2 with G(a#) = 2". 


SOLUTION (Exercise 5.7.4). 


(a) Apply the monotone convergence theorem (Theorem 5.1.2) with X,, = )77_, Sk 
and X= S04 Sn: 


(b) Apply the dominated convergence theorem (Theorem 5.1.3 with X,, = >>;_, Sk, 
X = Yin Sn and Z = Yi, |Se|- (By (a), B[Z] = 1 Ell Sl] < 00.) 


SOLUTION (Exercise 5.7.5). 


E [e**] = E [1 pxsoye*] + E [1px=o}] 
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But since limo<otec 1rxso}e o* = 0, limo<otoo E [1px>0}e7°* | = 0 by dominated 
convergence (1, x>oe °* < 1, an integrable random variable). Therefore the limit 
in question is E [1;x~9}] = P(f(X) = 0) 


SOLUTION (Exercise 5.7.6). 
The hypothesis implies that there exists an a € R such that e’ = EF [ettox I In 
particular (considering the real parts), 


1— EB [cos(to(X — a))| = E [1 — cos(to(X — a))] =0. 
Since 1 — cos(t9(X — a)) > 0, this implies that, P-a.s., 1 = cos(tg(X — a)), which 


in turn implies the announced result. 


SOLUTION (Exercise 5.7.7). 
Necessity. Write 


pelt) = B [eB] 


d a Z 
He" = l[£ [ets *s] = II PX; (u;) ’ 
jaa j=l i 


by the product formula for expectations. 


=E£ 


Sufficiency. Let X’ := (X},...,X/) € R* be a random vector whose independent 
coordinate random variables X{,...,X/ have the respective characteristic func- 
tions y1,..., a. The characteristic function of X‘ is ean y;(u;) and therefore 
X and X’ have the same distribution. In particular, X,,..., Xq are independent 
random variables with respective characteristic functions y),..., Qa. 


SOLUTION (Exercise 5.7.8). 
Take P := aot 2-"P,, check that it is a probability measure and that P, << P 
(n > 1). Then apply the Radon-Nikodym theorem. 


SOLUTION (Exercise 5.7.9). 
By Theorem 3.2.20, a necessary and sufficient condition for this is that for all 
u,v ER, 

Ele" | _ E,le™*|E,le”” |, 
where £4 denotes expectation with respect to P4. Then, observe that for an 
integrable or non-negative random variable Z, 


P(A)E4(Z] = E[Z1a]. 
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SOLUTION (Exercise 5.7.10). 
Observe that, by Fubini, 


pre <t<Yj)dt=E | Lyctcvat 
R R 


[pw <t<X)dt=E ff Lyctexa 7 
R R 


SOLUTION (Exercise 5.7.11). 
A. Apply Theorem 4.3.7. 
B. Apply A with y the characteristic function of this Gaussian vector. 


and 


SOLUTION (Exercise 5.7.12). 
The right-hand side is a function of X?. It remains to show that for all bounded 
measurable functions ¢, 


B [h(X)9(X*)] 


_ [2 fx (VX?) _ Sy fx (-VX2) ) *)| 
= E | (al OY TR ti VER TOV) EH) P(X) - 


It suffices to show that 


BWM ome] =B VF BBs) (0 


and 


E [R(X 1px<oy(X?)] = E [(h(-VR) | 0 x?)] 


We prove the first of these two equalities. Its right-hand side equals 


7 V x2 fx(v0?) x Xv )ar 
a (« TR ey) PO 


or (splitting the domain of integration) 


: pe) a) fx («)dax 
i (1 Fe) + flv) p(x") fx (a) 
0 f(a?) o ree 
+f Gar, ies) )fx(x) der, 
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that is, by the change of variable x ++ —zx in the second term, 


~ Vm E + h-/P)) nba: 
i (1 A/a) + ful av) (x) fx(x) d 


s : * a(V®)o(02) fx (a) dx = B[A(X)Lxsop(X?)] - 


SOLUTION (Exercise 5.7.13). 
EY [h(X)] = W(Y), where 


W(k)= [ h x)dx 
vk) = f nlayfie)ae 
and where f; is defined by 
| fae = Pe AY =H) (ABR), 


that is, 


[so jax — PX EAY =4) 
k 


—  PY=k) 
_PRR+ECAY =k) PR+ECA)PIY =k) 
~~ PW=h) ~~ P=) 


=P(k+€EA)=P(EECA b= f fle bate. 


Therefore 
f(a) = fe(w +h) 
and 
w(k) = [ore +k) dx = [me +k) fe(a) dx 
that is, 
Wk) = E[A(E + k)]. 
Finally: 
BY (W(X)) = f Bla + Yea) ae. 
R 
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SOLUTION (Exercise 5.7.14). 
E*1+*21X;)] is of the form h(X, + X2). By symmetry, E*1*+*2[X5] = h(X, + Xo). 
But 

BX +%21X)] + B42 1X] = BOOK, 4X9) = X14 Xo. 


Therefore 2h(X, + X2) = X14 Xo. 


SOLUTION (Exercise 5.7.15). 
We must find a measurable function h : [0,1] such that for all bounded measurable 
functions ¢ : [0, 1] 


B [min(X1, X2)p(max(X1, X2))] = E [h(max(X1, X2))y(max(X1, X2))] , 
in which case B™*@1*2) min(X,, X29] = h(max(X1, X2)). Now 
FE |[min(X1, X2)y(max(X1, X2))] 


=i, [ 1 {x1 <x} U1 Y(L2) dx, dr2 
+f [ Ltn <a1} a) (x1) dx, dx, 
o Jo 
1 22 1 
= 2 f (/ Ly ars) (x2) dxy = | z’y(r) dz. 
0 \Vo 0 


On the other hand, for a € [0, 1], 
P(max(X,, X2) < 2) = P(X, < 2, Xp < 2) = P(X, < 2)P(X, <2) =2° 


and therefore, max(X), X2) admits the probability density function 2x 1j9,1)(x) and 
1 
E [h(max(X1, X2))p(max(Xj, X2))] = | h(x)p(a)2x dz. 
0 


so that 2xh(x) = x? and finally B™*C-*2) min(X1, Xo] = 4 max(X,, X9). 


SOLUTION (Exercise 5.7.16). 
Let S := (S),...,5,) and T := (2i,...,T),). The probability density function of 
Sis 

— ee Sm) =A" (AEE a) lts,50 ear sm>0}} - 


Since S$; =7T,, Sp =7o-T,..., Sy =Tn—Th-1, the formula of smooth change 
of variables gives 


fr(t, a8 sj bmn) = fs(ti, te = th, see ybm — br-1) = Ne a(t, Sie iy bra); 
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where C := {(t1,...,tn);0 < ty < +--+ < tm}. The probability density function 
of T,, is obtained by integrating out t1,...,tm—1 in fr(ti,...,tm), which gives 
fr, (8) = (A™/(m — 1)!)e~S.. Therefore 


fin sabe Tai) (t15 ry tm, 8) 


fr(ti, apaaaee tn|Tm-+1 => s) => fr (s) 


\mtle-As 


™m: 
= (m+ 5 /mle* lo(th, ee) tm) | ftm<s} = gm lti<t2<-<tm<s 


This is indeed the probability density function of the vector obtained by reorder- 
ing (sU;,...,5Um), where (U;,...,Um) is a vector of m independent variables 
uniformly distributed on [0,1] . In particular 


E [e= sr] =E artes fe“ EE 7) 
[> eter [e REO] fag 
0 


-{ E fe“ EF S041] fr, (8) ds 
0 


= (E [e~Feun)])™ fins (8) ds 
0 


Tm+1 a 
-a((f ome) 
(0) Tim+1 


. (: Fuga 6! = ae) Tm+1 Tear 


Tm+1 


By the law of large numbers, limtoo ro =. Therefore, passing to the limit as 


m t co (dominated convergence), 


E [eRe #0) a= eA Ig? (1 e FO) doe 


SOLUTION (Exercise 5.7.17). 
Letting Z be the (random) bias of the coin, and X,, = 1 if the sun rises on day n, 
we have to compute 


P(Xnai =1|X1 =1,..., Xn =1) 
= P(X, =1,..., Xn =1, Xng1 =1)/P(X1=1,..., Xn =1). 
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But for all k > 1, 


1 
PST ADR | P(X, =1,..., X, =1|Z=p)dp 
0 


and therefore i 
P(Xnyi =1/Xi=1,..., X= =" 


n+2- 


SOLUTION (Exercise 5.7.18). 
Both EY [v(X)] and E7[v(X)] are functions of Y (since a function of Z is a function 
of Y!). To prove equality, it suffices therefore to show that for all bounded y, 


E[E™[(X)]~Y)] = £[E7P(X)|e(¥)] - 


Now 


E[B*[o(X)|e(¥)] = Elv(X)9(¥)] 
for all bounded y. On the other hand 

E [E*[v(X)}b(Z)] = E [o(X)v(Z)] 
for all bounded w, and in particular 

E[E*[v(X)]}o(Z)] = £ [o(X)9(¥] 
for all bounded y (h is bijective). 
SOLUTION (Exercise 5.7.19). 
Similarly to the unconditioned case 

Var (X|Y) = E[X?|Y] — E[X|Y?, 


and therefore 


E [Var (X|Y)] = EB [X?] - E[E[X|Y)’] . 
On the other hand 


Var (E[X|¥]) = B [B[X|Y] - E( [xI¥]P 
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Summing the last two equalities and using Var (X) = E[X?] — E[X}’, we obtain 
the announced equality. 


SOLUTION (Exercise 5.7.20). 
Just imitate the proof of Theorem 2.1.25. 


Check for 
updates 


Chapter 6 


Convergence Almost Sure 


Order hidden in chaos: an erratic sequence of coin tosses exhibits a remarkable 
balance between heads and tails in the long run, at least “when the coin is fair and 
fairly tossed”. This phenomenon is captured by the strong law of large numbers. 
The relevant mathematical notion, which is the object of this chapter, is that of 
almost-sure convergence of a sequence of random variables. 


6.1 A Sufficient Condition and a Criterion 


Consider a game of heads or tails with independent tosses of a single, possibly 
biased, coin. In other words, we have an IID sequence {X,,}n>1 of random variables 
taking two values, 1 (heads) and 0 (tails), with 


P(X, =1)=pé€ (0,1). 


Let 


Sy i= X, 4+ Xo+---+X,. 


The random variable S;,/n is the empirical frequency of heads after n tosses. We are 
interested in the limit of this quantity as n + oo. As we know “from experience” , 
the empirical frequency tends to p. In fact, this is a theorem: Borel’s strong law 
of large numbers, which asserts that 


p (au Sn 
ntoo 1 


More explicitly: the probability that there exists a limit of the sequence {22} 54 
and that this limit is p is equal to 1. We shall in general avoid in similar statements 
the use of the symbol 3 and write (6.1) in the form P (limptoc — = 1) = 1, or 
limp too + =p, P—-a.s. 


(6.1) 


| 
5 
| 
t 
Ss 
t 
7 
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Definition 6.1.1 A sequence {Z,}n>1 of random variables with values in C (resp. 
in R) is said to converge P-almost surely (P-a.s.) to the random variable Z with 
values in C (resp. in R) if 


P(lim Z, = Z) = 1. (6.2) 


This is also denoted by 


a.s 


Zn > Z. 


Paraphrasing: For all w outside a negligible set, limptoo Zn(w) = Z(w). 


In the case where the sequence takes values in R, the limit may be infinite. 
Otherwise, when P(Z < co) = 1, one may add the precision: “converges to a finite 
limit” . 


The Borel—Cantelli Lemma 


This is one of the fundamental tools in the study of almost sure convergence. 
Consider a sequence of events {A,}n>1. We are interested in the probability 
that A, occurs infinitely often, that is, the probability of the event 


{An i.o.} := {w;w € A, for an infinity of indices n}, 


where 7.0. abbreviates infinitely often. We have (Borel-Cantelli lemma): 


Theorem 6.1.2 For any sequence of events {An}n>1, 


> P(A,) <6) == (4,00) =0. 


=i 


Proof. We first observe that 


co 


{An io} =) LAs. 


n=l1k>n 


(Indeed, if w belongs to the set on the right-hand side, then for alin > 1, w belongs 
to at least one among Ay, An4i1,..-., which implies that w is in A, for an infinite 
number of indices n. Conversely, if w is in A, for an infinite number of indices n, 
it is for alln > 1 in at least one of the sets A,, An4yi,....) 
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The set Ux>nAx decreases as n increases, so that by the sequential continuity 
property of probability, 


P(A, i.o.) = lim P (U a) ; (6.3) 


ntoo kon 
But by the sub-o-additivity property of probability, 
P (U a) < 3 P(A), 
k>n k>n 


and by the summability assumption, the right-hand side of this inequality vanishes 
as nto. 


The next result is usually called the converse Borel-Cantelli lemma. It is in fact 
a “pseudo-converse” since an additional assumption of independence is required. 


Theorem 6.1.3 Let {An}n>1 be a sequence of independent events. Then, 


> P(An) =oo = P(A, 0) =1. 


il 


Proof. We may without loss of generality assume that P(A,) > 0 for all n > 1. 
The divergence hypothesis implies, by the fundamental theorem of convergence of 
infinite products,! that for all n > 1, 


[oe} 


[[@- P(A,) =0. 


k=n 


This infinite product equals, in view of the independence assumption, 


Item -(fi%) -1-r(Ga) 


p(Ua) =1 


' Let {an}n>1 be a sequence of numbers in the interval [0,1). Then: (a) if 0°°) an < 00, 
then limptoc []p_1 (1 — ax) > 0, and (b) if 97°, an = 00, then limntoo []p_, (1 — ax) = 0. 


Therefore, 
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and by (6.3), 


EXAMPLE 6.1.4: BINARY SEQUENCE. Let {X,}n>1 be a sequence of random 
variables with values in {0,1}, with P(X, = 1) =p, (n > 1). 


If 0, Pn < 00, then, by the direct Borel-Cantelli lemma, P(X, = 1 i.o.) = 0, 
and therefore P(limpto Xp = 0) = 1. 


If }>,, Pn = oo, and if moreover the sequence is independent, then, by the 
converse Borel—Cantelli lemma, P(X,, = 1 i.o.) = 1, and therefore the sequence 
cannot converge to 0. Therefore, in the independent case, a necessary and sufficient 
condition for convergence to 0 is }>), pn < ©. 


A Sufficient Condition 


The following sufficient condition guaranteeing almost-sure convergence is the 
most useful. It is a direct consequence of the Borel—Cantelli lemma. 


Theorem 6.1.5 Let {Zn }n>1 and Z be complex random variables. If 


> P(\Z,— Z| = &n) < 00 (6.4) 


n>1 


for some sequence of positive numbers {En}n>1 converging to 0, then the sequence 
{Zn}nsi converges P-a.s. to Z. 


Proof. Obviously, if {€,}n>1 is a sequence of positive real numbers converging to 
0, then any sequence of non-negative real numbers {Z,},>1 such that x, > €,, for 
only a finite number of indices n > 1 also converges to 0. Therefore it suffices to 
prove that 

P(\Z,—-Z| >, t.0.) =0. 


But this follows from hypothesis (6.4) and the Borel—Cantelli lemma. 


A Criterion 


The result below is essentially of theoretical interest. It will be used later on for 
comparing convergence in probability and almost-sure convergence. 
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Theorem 6.1.6 The sequence {Zn}n>1 of complex random variables converges 
P-a.s. to the complex random variable Z if and only if for alle > 0, 


P(\Z, — Z| > € i.0.) =0. (6.5) 


Proof. For the necessity, observe that 


{|Zn — 2] 2 € #.0.} © {w; lim Zn(w) = Z2(w)}, 


and therefore 


P(\Z, — Z| > € 7.0.) <1 ~ P(lim Zn, =Z)=0. 


For the sufficiency, let N;, be the last index n such that |Z, — Z| > z (letting 
N, := 00 if |Z, — Z| > ¢ for an infinity of indices n > 1). By (6.5) with « = 7, we 
have P(N; = 00) = 0. By sub-o-additivity, P(Uss1{.N. = co}) = 0. Equivalently, 


P(N, <0, for all k > 1) = 1, which implies P(lim,;. 2, = Z) = 1. 


6.2 The Strong Law of Large Numbers 


In order to prove Borel’s strong law of large numbers using Theorem 6.1.5, we 
must have some adequate upper bound for the general term of the series occurring 
in the left-hand side of (6.4). The basic tool for this is Markov’s inequality. 


The proof of (6.1) indeed relies on the Borel—Cantelli lemma and the Markov 
inequality. In fact, we shall apply Theorem 6.1.5, and for this we need to bound 
the probability that | Se - p| exceeds some € > 0 where p := E[X,], which can be 
done by application of Markov’s inequality as follows: 


p (|S-s]><) =" ((S-»)'>2 


B|(—p)'] i oo 4] 


where Y; := X; — p. In view of the independence hypothesis, 
E[MYYsYi] = EMIJEMIEYS EY = 0, 


EMM Y,)] = EM) EY,’] =0, 
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and the like. Finally, in the development 


E (s») - > aMyyrd, 


i,j,k,€=1 


only the terms of the form E[Y;“] and E[Y?Y/] (i # j) remain. There are n 
terms of the first type and 3n(n — 1) terms of the second type. Therefore, only 
nElY}] + 3n(n — 1)E[Y?Y?] remains, which is less than Kn? for some finite K. 
Therefore 


: . , = 
and in particular, with e = ns, 


Therefore, by Theorem 6.1.5, 


Sn ; 
= p| converges almost surely to 0. 


EXAMPLE 6.2.1: PATTERNS IN A BERNOULLI SEQUENCE. Let k be a positive 
integer. Let {ni}i<i<z be a strictly increasing finite sequence of positive integers 
with n; = 1. Let {e;}1:<i<, be a sequence of 0’s and 1’s. The sequence of pairs 
{ (ni, €i) }ici<r is called a k-pattern. Patterns are represented by sequences of 0’s, 
1’s and -’s, where - is an “unspecified binary digit”. For instance, the 4-pattern 


(1, 0), (3, 1), (4, 1), (6, 0) 


is represented by 0- 11-0, and this pattern is said to occur at position n in a 
sequence 21, 2,... of binary digits if and only if 


In = 0, In+2 = 1, Fn+3 = 1, 2n45 =0. 


Let now {X,}n>1 be an MID sequence of 0’s and 1’s such that P(X; = 1) =p € 
(0,1). Define for all n > 1 the random variable Y,, with values 0 or 1 by 


Ync=1liff Xnyn, =e; foralli (L<i<k), 
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that is, iff the pattern occurs at position n. Then (exercise): 


k 
Mite: +¥n ag hk okn 
——— > : I h:= a x). 
= pq where z é (x) 


In particular, the empirical frequency of any k-pattern in a fair game of heads 
or tails equals oa Since the Bernoulli sequence with p = s (the random sequence 
“par excellence”) satisfies (x) for all possible patterns, one is tempted to call a 
deterministic sequence (x, n > 1) of 0’s and 1’s “random” if for all patterns 


Yr ret + Yn 1 
 —$[V————— SS —— 


li 


ntoo n Qk? 


where the y,’s are defined in the same way as the Y,,’s above. Although this 
definition seems reasonable, it is not satisfying. In fact, one can show that the 
(rather deterministic!) Champernowne sequence: 


0110111001011101111000 ... , 


which consists of the succession of integers (starting with 0) written in base 2, is 
random in this sense. 


Kolmogorov’s Strong Law of Large Numbers 


Borel’s proof is easily adapted to the case where the X,,’s are uniformly bounded. 
In 1933, Kolmogorov gave the following more general form of the strong law of 
large numbers that requires only that the X,,’s be integrable. 


Theorem 6.2.2 Let {Xn}n>1 be an ID sequence of random variables such that 


E||Xq|] < co. (6.6) 
Then, 
IP (tim 7 = B1xi)) =I. (6.7) 


Proof. We may suppose, without loss of generality, that E [|X | = 0. The proof is 
in two parts. In Part A the strong law is proved with the additional assumption 
that o? := E[X?] < oo, and then Part B gets rid of this assumption. 

A. Let 


Zn = sup (|Xmo41 ale Sar Xian) : 
1<k<2m41 
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Defining for all n > 1 the integer m(n) by 
m(n)? <n < (m(n) +1), 


we have that 


Sa) < |S 4 Zn, 
< rata! * int? 


Since lim,to m(n) = +00, it suffices to prove that 


lim | Sin) |=0 (x) 
ntoo m(n)? 
and Z 
] =0. 
Ge i) 


For all ¢ > 0, by Chebyshev’s inequality, 


p (1821 <e) < Ya Gee) _ meet = ot 
m 


~ mite? mitez mer" 


Therefore Sera Sy > e) < oo, which implies (x) (by the Borel—Cantelli 


lemma and Theorem 6.1.6). 


Let now 
£5 ed 


If |Zm| > me, then for at least one k (1 < k < 2m+1), |&| > m?e. In other 
words, 


{> > fe oU (lal > me}, 


so that 


VA 2m+1 2m+1 
P (= > :) <P ( U {|Ex| > m9) < > P (|| > me) , 
k=1 k=1 


and therefore, by Chebyshev’s inequality, 


2. “et! Var (€:) 
p(eee) <d) e. 


k=1 


6.2. THE STRONG LAW OF LARGE NUMBERS 229 


Now, when k < 2m +1, 


k 
Var (&) = S- Var (Xm24i) < (2m +4 1)o? 


229 
p(Sp>0) < Gury 


m ~ mi_e2 


and therefore 


? 


so that 7 
s- P (= > :) <0oOo, 
m 
m>1 
and then (««) follows from the Borel—Cantelli lemma and the criterion of almost- 


sure convergence. 


B. It remains to get rid of the assumption of finiteness of the second moment. 
The natural technique for this is truncation. 


Let 
Xx, i X,, if |Xn| <7; 
0 otherwise. 


We proceed in three steps. 
Step 1. We first show that 


La, & 
ntoo .=1 


In view of Part A, it suffices to prove that 


“EX, — BLx, |? 
= [( [Anl)*] 


< OO. 
n2 


But 
El(Xn — E[X,])”] = E[X3] = ELX?1gxa<n}] : 
It is therefore enough to show that 
~~ E[X?1 +8 
‘ [Xz (bas bl eas 
n 
n=1 


The left-hand side of the above inequality is equal to 


se 7 Xt 1 fe—-1<|Xi|<h}] =“yy4 = ELX7 1 a-1<|xy\<h}] - 
neat” k=1 


pee 
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Using the fact that 


(draw the graph of 7 > x~?), this quantity is less than or equal to 
s pel 1gk-1<|X1|<k}] = 2S °E Sp Henisixasey 
k= k=1 


dL. 


2S 7 El Xi|1e—rexii<ey] = 2E[|Xi] < 00. 
k=1 


IA 


Step 2. Since E||X;1|] < oo, we have by dominated convergence that 
lim E[X114\x,|<n}] = E[%Xi] = 0. 
Since X,, has the same distribution as X, 
lim E[Xy] = lim E[X 1 gxa\cn}] = lim LX xin] = 2[X] = 0. 


In particular, by Cesaro’s lemma,” 


lim _ S- E[X;] =0. 
k=1 


ntco 1 
Step 3. We have 


do P(Xnl > 2) = S02 PX] > 2) < El|Xil] < 00, 


n=1 


and therefore, by the Borel—Cantelli lemma, 
P(X, # Xp io.) = P(Xn > nio.) =0, 


which implies that 


eo) ae 
lim — = lim —. 
ntoo nN ntoo 1 
2 Let {bn }nso be a sequence of real numbers such that limptobn = 0. Then 


bite +by 0 


limp too = 
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The next result shows that the integrability condition is, in a sense, also nec- 
essary. See however Exercise 6.6.4. 


Theorem 6.2.3 Let {Xn}n>1 be a sequence of ID random variables such that 
Sp 
ee P-a.s., 


where S, := X,+-+--X,. Then E ||X,|] < 00 and C = E[X)]. 
Proof. Under these circumstances, 
Xn = Sp n—1 Sn-1 


+0 
n n nn-tl 


and therefore, P(|X,,| > ni.o.) = 0. By the converse Borel—Cantelli lemma, 
> P(|x,| =n) < 00 
n=1 
or, since the distribution of any X,, does not depend on n, 
S> P(Xi| > 12) < oo. 
n=1 


But, by the following inequalities concerning any non-negative random variable X 
(Exercise 5.7.2) and |X,| in particular, 


do P(X =n) < B[X] <14+ 0 P(X =n), 
n=1 n=1 
we have that E ||X,|] < oo. The identification of C and F'[|X,|] is then just the 
strong law of large numbers. 


Large Deviations from the Strong Law of Large Numbers 


The large deviations theory of random variables produces estimates for the devi- 
ation of such variables from their means. When applied to sums of IID variables 
S, = X,+--+:Xp, these estimates complement the strong law of large numbers. 


The type of result produced by this theory is, in the case where the X;’s are 
1D and integrable, with common mean m, 


1 n 
lim — log P (|= —m 
n 


ntoo 1 


> a) = -h(a), (*) 
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where a > 0 and h(a) > 0. Such bounds have important theoretical implications, 
but they are somewhat imprecise in that the meaning of (x) is 


where n~! log g(n) tends to 0 as n t co, but perhaps too slowly and in an uncon- 
trolled manner. 


To obtain practical (upper) bounds, it is often useful to look at specific cases, 
using the Chernoff bound below at the origin of the general abstract theory. These 
powerful bounds are easy consequences of the elementary Markov inequality. 


Theorem 6.2.4 Let X be a real-valued random variable and let a € R. Then 
(Chernoff’s bound) 


; B ef] 
POX >a) < min — > (6.8) 
and ix] 
E Je 
< <mi : : 
P(x <a) < min — =, (6.9) 


Proof. By the +-monotony of 7 +> e* and Markov’s inequality, 


E tX 
Pix Sa=]Pie" Se" < i (t>0), 
ela 
and 
E tX 
PIX <aSrie" Se") < [e (t <0). 
ela 


The result follows by minimizing the right-hand sides with respect to t > 0 and 
t < 0, respectively. 


EXAMPLE 6.2.5: LARGE DEVIATIONS FOR THE POISSON VARIABLE. Let X be 
a Poisson variable with mean @ and therefore E [ex | = e'-) We prove that 


forc > 0 
1 6 O+¢ 
P2049) <exp{ —t( =) . 
e ef 


With a = 6 +c in (6.8): 


@(e#—-1) 


P(X >0+c) < min ee ia le ed a 
= ~ 


0 et(O+e) 
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The derivative of the function f : t+ t(0+c)—O(e'—1) at t > 0 is 0+c—6e', and 
it is null for e' = 4 or equivalently t = In(@ + c) — In(@), and this corresponds to 
a maximum since the second derivative —e! is negative. Therefore 


t>0 e€ 


and finally 


if 6 O+c 
P(X >6+0) <on{-2( “*) 
e\ eé 
Theorem 6.2.6 Let X1,...,X, be IID real-valued random variables and let a € R. 
Then, 
P (>: Ce: va Se 
i=1 
where 
h* (a) = sup{at — In E [e}}. (6.10) 
t>0 


Proof. For all t > 0, Markov’s inequality gives 


P (>: X,;> va =P (1 {ox > cof 


i=1 i=1 


exp {3 x] xe 
i=1 


< exp{—n (at — In E [e*"])}, 


<P 


from which the result follows by optimizing this bound over t > 0. 


Suppose that E [ee*2] < co for all t > 0. Differentiating tH at —InE [ee*1] 
B[Xie*1| 
Blet*1 
differentiable on R, with derivative at 0, equal to a — FE [X,], which implies that 

when a > E [X,], h*(a) is positive. 


yields a — , and therefore the function t'> at -InE [ef] is finite and 


Similarly to (6.10), we obtain that 


P (>: xX; < va < eo (@) | 
i=l 
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where 
h(a) = sup{at — In E [e*™}}. 


t<0 
Moreover, if a < E[X,], h(a) is positive. 
The Chernoff bound can be interpreted in terms of large deviations from the 


law of large numbers. Denote by 4 the common mean of the X,,’s, and define for 
é > 0 the (positive) quantities 


H*(e) =sup {et -In E [1 )]} 


t>0 
H(e) = sup {et —InB [ef] } 
t<0 
Then 
1 n 


i=1 


> +s] <e te) pent). 


The computation of the supremum in (6.10) may be fastidious. There are short- 
cuts leading to practical bounds that are not as good but nevertheless satisfactory 
for certain applications. 


EXAMPLE 6.2.7: LARGE DEVIATIONS FOR THE RANDOM WALK. Suppose for 
instance that {X,}n>1 is 1D, the X,,’s taking the values —1 and +1 equiprobably 
2 


so that E [e'*] = det! + de~'. Replacing $e*! + $e! by the upper bound eT, we 
have that, for a > 0, 


and therefore, with t = a, 


n 
i=l 


By symmetry of the distribution of }>;"_, X;, we obtain for a > 0 


P (>: Xi < -ne) =P (>: Xi > vs) < eons 
i=1 


i=1 
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and therefore, combining the two bounds, 


p([oa 


6.3. Kolmogorov’s Zero-one Law 


> vo) < Qe-2 


(The result of this section will be used only in the chapter on martingales.) 


Definition 6.3.1 Let {Xn}ns1 be a sequence of random variables and let FX := 
o(X1,...,Xn). The o-field T* := Nn310(Xn, Xn41,---) is called the tail o-field 
of this sequence. 


EXAMPLE 6.3.2: For any a € R, the event {limppo ae < a} belongs to 
the tail o-field, since the existence and the value of the limit of a does not 
depend on any fixed finite number of terms of the sequence. More generally, any 
event concerning limytoo a such as, for instance, the event that such limit 


exists, is in the tail o-field. 


Recall the notation FX := VnsiF**. 


Theorem 6.3.3 The tail o-field of a sequence {Xn}n>1 of independent random 
variables is trivial, that is, if A€ T*, then P(A) =0 or 1. 


Proof. The o-fields FX and o(Xn+n,Xn+e41,---) are independent for all k > 1 
and therefore, since T* = Nn>10(Xntk, Xntk+1), the o-fields 7 and J* are 
independent. Therefore the algebra Cee Zz and T~* are independent, and con- 
sequently (Theorem 5.4.2) FX and T* are independent. But FX D> T*, so that 
T~ is independent of itself. In particular, for all A€ T*, P(AN A) = PA)P(A), 
that is P(A) = P(A), which implies that P(A) = 0 or 1. 


6.4 Related Types of Convergence 


Convergence in Probability 


This type of convergence is closely related to almost-sure convergence, yet weaker, 
as we shall see. 
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Definition 6.4.1 A sequence {Z,}n>1 of complex random variables is said to con- 
verge in probability to the complex random variable Z if, for alle > 0, 


lim P(|Z,, — Z| ><) =0. (6.11) 


ntoo 


Theorem 6.4.2 A. If the sequence {Z,},., of complex random variables con- 
verges almost surely to some complex random variable Z, it also converges in 
probability to the same random variable Z. 


B. If the sequence of complex random variables {Xn}n>1 converges in probability 
to the complex random variable X, one can find a sequence of integers {nx}k>1, 
strictly increasing, such that {Xn,}k>1 converges almost surely to X. 


B says, in other words: From a sequence converging in probability to some 
random variable, one can extract a subsequence converging almost surely to the 
same random variable. 


Proof. A. Suppose almost-sure convergence. By Theorem 6.1.6 , for all € > 0, 
P(|\Z, — Z| > 20.) =0, 


that is 
P(t Wey (Zp — Z| 2 €)) = 0, 


or (sequential continuity of probability) 


lim P(UR.,, (|Z% — Z| > €)) =0, 


ntoo 
which in turn implies that 
lim P(|Zn — Z| >¢)=0. 
B. By definition of convergence in probability, for all « > 0, 
Him F()Xq = 2 | >e)=0. 
Therefore one can find n, such that 


1 
P (IX - X12 ¢) < (5) 
~ 1/7 \2 


Then, one can find ng > n; such that 


2 
rfp) <( 
2 2 
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and so on, until we have a strictly increasing sequence of integers nz (k > 1) such 
that 
1 i" 
P(|xX,,-X|>=—)< [=] . 
(1%. - x12) < (5) 
It then follows from Theorem 6.1.5 that 


lm xX, =X as. 
ktoo 


Exercise 6.6.6 gives an example of a sequence converging in probability, but 
not almost surely. Thus, convergence in probability is in general a notion strictly 
weaker than almost-sure convergence. However, Exercise 6.6.7 gives an important 
example where both convergences occur simultaneously. 


There exists a distance between random variables that metrizes convergence in 
probability, namely 


d(X,Y):=E||X-Y|Al]. 


(The verification that d is indeed a metric is left as an exercise.) This means the 
following: 


Theorem 6.4.3 The sequence {Xn}n>1 converges in probability to the variable X 
if and only if 
ie A Xion XX) =O. 


Proof. If: By Markov’s inequality, for e € (0, 1], 


d(Xy, X 
P(|\X, —X|>2e) = P(|X,-X|Al>e)< = 
Only if: For all ¢ > 0, 
d(XnX) = f (Xa —X|A1aP + | ([Xn — X|A1)aP 
{|Xn—X|>e} {|Xn—X|<e} 


< P(\X, -X|>e)t+e. 


If the sequence converges in probability, there exists an no such that for n > no, 
P(|X, — X| > ¢) < © and therefore d(X,, X) < 2e. Since ¢ > 0 is arbitrary, we 
have shown that lim,;.. d(X,,X) = 0. 
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Convergence in the Quadratic Mean 


This type of convergence concerns sequences of square-integrable random variables. 


Definition 6.4.4 A sequence {Zn }n>1 of square-integrable complex random vari- 
ables is said to converge in the quadratic mean to the square-integrable complex 
random variable Z if, for alle > 0, 


lim E||Z, — Z|?] =0. (6.12) 


The next result follows from the fact that L%,(P), the collection of square- 
integrable complex-valued random variables, is a Hilbert space when endowed 
with the inner product 

(X,Y) := BE[XY*]. 
(This is a particular case of Theorem 4.4.19.) In particular, 
Theorem 6.4.5 For the sequence {Zy}n>1 of square-integrable complex random 
variables to converge in the quadratic mean to some square-integrable complex ran- 
dom variable Z, it is necessary and sufficient that 


lim El|Zn — Zm{?] = 0. (6.13) 


We now give the property of continuity of the inner product. 


Theorem 6.4.6 Let {Xn}nsi {Yn}ns1 be two sequences of square-integrable com- 
plex random variables that converge in the quadratic mean to the square-integrable 
compler random variables X and Y respectively. Then, 


tim EX Ym = /E| BOO". (6.14) 
Proof. We have 
|E[XnYn,] — ELXY")| 
= |El(Xn — X)(Ym — Y)"] + E[n — X)Y"] + E[X (Yn — Y)"]| 
S |E[(Xn — X)(Y¥m — Y)*]| + [E[(Xn — X)Y"]| + |E[X (Yn — YI 


and the right-hand side of this inequality is, by Schwarz’s inequality, less than or 
equal to 
(E[|Xn — X71)? (El]¥n — ¥ 171)? 
1 1 
(El Me = 01)? (AIP)? 


+ (E[X/])? (E[¥n -YP))?, 
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which tends to 0 as n, m fT co. 


Theorem 6.4.7 If the sequence {Z,}n>1 of square-integrable complex random 
variables converges in the quadratic mean to the complex random variable Z, it 
also converges in probability to the same random variable. 


Proof. It suffices to observe that, by Markov’s inequality, for all <¢ > 0, 


P(Zn -Z|22)< GE (Zn — Z|]. 


EXAMPLE 6.4.8: CONVERGENCE IN QUADRATIC MEAN OF SERIES. Let {An}nez 
and {B,}nez be two sequences of centered square-integrable complex random vari- 


ables such that 
>2 ELIA] < 00, } 7 Ell B;)?] < 00 
jeZ jeZ 
Suppose, moreover, that 
E[A;A3| = E[B,By] = E[A;B7] =0 G44). 


Let 7 , 
Un = a Aj, Va c= x B; : 
j=—n j=—n 


Then {Un}n>1 (resp., {Vn}n>1) converges in the quadratic mean to some square- 
integrable random variable U (resp., V) and 


E(U] = B[V] =0 and E[UV*] = 5_ B[A;B¥]. 


jez 


Proof. We have 


E[|Un — Umnl?] 


=> YY BAAN = S* BIAS? 


jg=nt+li=n4+l j=ntl 


SAL 


jg=nt+l 


since E[A;A;] = 0 when i # j. The conclusion then follows from the Cauchy 
criterion for convergence in the quadratic mean, since 


lim E(|U, —Up{?] = hi EIA] = 
jim, Ell I a. (143171 
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in view of hypothesis }>,-, E[|A;|?] < oo. By continuity of the inner product in 


Le(P), 


jEZ 


E(UV*] = lim E[U,V;] = li BIA;B; 
[UV"] = lim B(U,Y,"] soaps [Aj Bil 


= lim S” E[A;B3] = 5— B[A;B3). 
j=l 


jEZ 


6.5 Uniform Integrability 


The monotone and dominated convergence theorems are not all the tools that we 
have at our disposition giving conditions under which it is possible to exchange 
limits and expectations. Uniform integrability, which will be introduced now, is 
another such sufficient condition. 


Definition 6.5.1 A collection {X;i}ier (where I is an arbitrary index) of integrable 
random variables is called uniformly integrable if 


tim f |X;| dP = 0 uniformly ini eT. 
{|Xi|>c} 


ctoo 


EXAMPLE 6.5.2: COLLECTION DOMINATED BY AN INTEGRABLE VARIABLE. 
If, for some integrable random variable, P(|X;| < X) = 1 for all i € J, then 
{X;}icr is uniformly integrable. Indeed, in this case, 


/ |X;| dP < | XdP 
{|Xil>e} {X>e} 


and by monotone convergence the right-hand side of the above inequality tends to 
Oascto. 


Clearly, if one adds a finite number of integrable variables to a uniformly inte- 
grable collection, the augmented collection will also be uniformly integrable. 
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Theorem 6.5.3 The collection {X;}icr of integrable random variables is uni- 
formly integrable if and only if 
(a) sup; E [|X;|] < co, and 


(b) for every e > 0, there exists a 6(€) > 0 such that 


sup f |X;| dP < e whenever P(A) < 6(e). 
(In other words, f, | Xi] ae — 0 uniformly in 7 as P(A) — 0.) 


Proof. Assume uniform integrability. For any ¢ > 0, there exists a c such that 
Sexe |X;|dP <e for allie J7. Forall Ac F, allie J, 


/ IX] dP < P(A) +f IX, dP < cP(A) +4 
A {|Xi]>c} 


Therefore we have (b) by taking (¢) = = and (a) with A=. 

Convene: let M := sup; E'[|X;|] < oo. Let € and d(e) be as in (b). Let 
Co = Oe For all c > co and alli € J, P(|X;| > c) < 6, (Markov’s inequality). 
Apply (b) with A = {|X,| > c} to obtain that sup,, Jeixetse} |X;| dP <e. 


Since the “collection” consisting of a single integrable variable X is uniformly 
integrable, condition (b) of the theorem above reads 


sup E[|X|1l4] > 0asd->0. (6.15) 
A; P(A)<6 


This simple observation will be used in the proof of the next result. 


Theorem 6.5.4 Let Y be an integrable random variable and let {F;}ier be a col- 
lection of sub-o fields of F. The collection X; := E[Y | Fj] (¢ € I) is uniformly 
integrable. 


Proof. By Jensen’s inequality, 

Xi] =|E[Y | Fill < [IY] | Fil 
and therefore, for all a > 0, 

E (|Xi| laxase] S (Zi) 1zioq 


where Z; := E||Y|| Fi]. By definition of conditional expectation, since 
{Z; z a} € Fis 
E[(IY| — 4) lzzay] = 0 
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and therefore 
EB [Xi] 1yx,|>}] <E (IY 12,203] ; i) 


By Markov’s inequality, 


BIZ _ BI 


d 


P(Z; >a) < 
( 2) = a a 


and therefore P(Z; >a) — 0 as a > oo uniformly in 7. Use (6.15) to obtain that 
E ||Y| 1,z,5a}] + 0 as a > 00 uniformly in i. Conclude with (x). 


Theorem 6.5.5 A sufficient condition for the collection {X;}ier of integrable ran- 
dom variables to be uniformly integrable is the existence of a non-negative non- 
decreasing function G:R — R such that 

G(t) _ 


[hia ———= == <hee 
ures 1b 


and 


sup [G(|X;])] < 00. 


Proof. Fix ¢ > 0 and let a = “ where M := sup,( [G(|X;|)]). Take c large 
enough so that G(t)/t > a for t > c. In particular, |X;| < Axl) on {|X;| > c} 
and therefore 


1 M 
/ |Xi| dP < —E [G(|Xi|)1yx,js] < — =e 
{|Xi|>e} ¢ a 


uniformly in i. 


EXAMPLE 6.5.6: TWO SUFFICIENT CONDITIONS FOR UNIFORM INTEGRABIL- 
ITy. Two frequently used sufficient conditions guaranteeing uniform integrability 
are 
sup EB [|X;|'**] < co (a> 1) 
a 


and 
sup [|Xi| log* |.Xi]] < oo. 


Almost-sure convergence of a sequence of integrable random variables to an 
integrable random variable does not necessarily imply convergence in L!. However: 
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Theorem 6.5.7 Let {Xy}n>1 be a sequence of integrable random variables and let 
X be some random variable. The following are equivalent: 


(a) {Xn}nsi is uniformly integrable and X;, Ps X asn— oo. 


(b) X is integrable and Xp, EG asi co, 


Proof. (a) implies (b): Since X,, *¢ X, there exists a subsequence {Xn, }eo1 such 
that X,, “SX. By Fatou’s lemma, 


E||X|| < lim inf B [|Xn,|] < sup E [|Xy,|] < sup EF [|X,|] < co. 
Nk n 
Therefore X € LR(P). Also for fixed € > 0, 


BUX xis [ IX, —X] dP 4. 
{|Xn—X|<e} 


+f IX, ap+ | IX] dP 
{|Xn—X|Ze} {|Xn-X|2ze} 


<e+ | Kal dP +f |X| dP. 
{|Xn-X|2e} {[Xn—-X|2e} 


Recall that adding an integrable random variable to a uniformly integrable collec- 
tion retains uniformly integrability. Apply (b) of Theorem 6.5.3 to the uniformly 
integrable family {X,}n>0 where Xo := X, denoting by 6’ the corresponding 6. 
By hypothesis, P(|X, — X| > ¢) < 6’ for large enough n. By (b) of Theorem 
6.5.3 with A := {|X, — X| > ¢}, for large enough n, Sein —x1>} |X,| dP < ¢ and 
Jeixn—x1>0} |X| dP <«. Therefore, F'[|X, — X|] < 3¢ for large enough n, thus 
proving convergence in L!. 

(b) implies (a): Let ¢ > 0 be given and let no be such that F'[|X, — X|] < ¢ 
for all n > no. The random variables X, X1,..., Xn, being integrable, there exists 
a 6 > 0 such that if P(A) < 6, f, |X| dP < § and f, |X,| dP < § for n < no. If 
n > no, by the triangle inequality, 


fis ap < f 1x| aP+ | |X,-X! dps be, 
A A A 


and therefore (b) of Theorem 6.5.3 is satisfied. Whereas (a) of Theorem 6.5.3 is 
satisfied since E {|X,|] < E [|X, — X|] + [|X|]. 


6.6 Exercises 


Exercise 6.6.1. IN PROBABILITY BUT NOT ALMOST SURELY 
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Let {X,}n>2 be an independent sequence of random variables such that 


iPte20si-—— os 


ninn ~ 


P(X, =n) = P(Xn n) DQ). 


2nlnn 


Let S, := >0y_, Xi. Prove that - — 0 in probability but not almost surely. 


Exercise 6.6.2. A RECURRENCE EQUATION, TAKE 2 
Recall the notation a+ = max(a,0). Consider the recurrence equation, 


Xn41 — (X,, _ 1y* ae Zn+1 (n 2 0) ) 


where Xo and Z,, (n > 1) are integer-valued random variables, and {Z,,},,., is UD 
and independent of Xo. 


(a) Show that limp; X, = +00 if FE [Z|] > 1. 


(b) Let To be the first time n > 1 for which X,, = 0. Show that if FE [Z] < 1, then 


Exercise 6.6.3. ASYMPTOTICS OF THE RENEWAL PROCESS 
Let {S,,}n>1 be an IID sequence of real random variables such that 


P(0 < 8S, < +00) =1 and E[S)] < co, 


and let for each t > 0, N(t) = Yin51 log(Tn), where T, = 5S, +---+ Sn. (The 
sequence {T},}n>1 is called a renewal process.) 


(a) Prove that P-almost surely limo. T;, = co and lim. N(t) = 


(b) Prove that P-almost surely lim;+. xo = EAE 
Exercise 6.6.4. SLLN FOR NON-NEGATIVE SEQUENCES 


Let {Xn}n>1, be an IID sequence of non-negative random variables such that 
E[X,] = 0. Show that 


ee a 
i =oco (= E[X)). 


ntoo n 


Exercise 6.6.5. A RESULT FROM ANALYSIS 
Let f : [0,1] > R be a continuous function. Prove that 


tim f [fs (fH am +t) dx,---dz, 
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exists, and find it. (A probabilistic proof is required.) 


Exercise 6.6.6. CONVERGENCE IN PROBABILITY BUT NOT ALMOST-SURE, II 
Let {Xn }n>1 be a sequence of independent random variables taking values in {0, 1}. 


(A) Show that a necessary and sufficient condition for this sequence to converge 
almost surely to 0 is 30,5, P(Xn = 1) < ov. 


(B) Show that a necessary and sufficient condition for this sequence to converge 
in probability to 0 is lim,;.. P(X, = 1) = 0. 


(C) Deduce from the above that convergence in probability does not imply in 
general almost-sure convergence. 


Exercise 6.6.7. WHEN CONVERGENCE IN PROBABILITY IMPLIES ALMOST-SURE 
CONVERGENCE 

Let {Xn}n>1 be a sequence of non-negative random variables. Let S,, := X,+---+ 
X,,. Show that the convergence in probability of the sequence {S,,},>1 implies its 
almost-sure convergence. 


Exercise 6.6.8. IN PROBABILITY AND IN THE QUADRATIC MEAN 
Let a > 0, and let {Z,,}n>1 be a sequence of random variables such that 


1 1 
P(Z, =1)=1-—, P(Z, =n) =—. 
(Z, = 1) =1- <2, P(Z=n) = — 
Show that {Z,}n>1 converges in probability to some variable Z to be identified. 
For what values of a does {Z;,}n>1 converge to Z in the quadratic mean? 


Exercise 6.6.9. CONTINUITY OF THE MEAN AND VARIANCE. 
Prove the following: If the sequence {Z,,},>1 of square-integrable complex random 
variables converges in the quadratic mean to the complex random variable Z, then 


lim £ [Z,] = F(Z] and in B [lZn|7] = [|Z/7] - 


ntoo 


Exercise 6.6.10. 9(Z,,) 

Suppose the sequence of random variables {Z,}n>1 converges to a in probability. 
Let g: R— R be a continuous function. Show that {g(Z,,)}n>1 converges to g(a) 
in probability. 


Check for 
updates 


Chapter 7 


Convergence in Distribution 


The next fundamental notion of convergence after almost-sure convergence is con- 
vergence in distribution, and the main result there is the central limit theorem, the 
heart of statistics, which is the art of assessing probability models (is this coin 
fair?). Although these notions are linked in various ways, they are fundamentally 
different. 


7.1 Paul Lévy’s Criterion 


Let {X,}n>1 and X be real random variables with respective cumulative distribu- 
tion functions {F),}n>1 and F. The “natural” definition of convergence in distri- 
bution of {X,,}ns1 to X could be the following: 

lim F(z) = F(z) (# €R). (x) 


ntoo 


In this provisional definition, there is no restriction on the z’s in R for which 
(x) is required. However, if it was adopted, one could not say that the “random” 
(actually deterministic) sequence of random variables X,, = a+ + wherea € R 
converges in distribution to X = a. The following definition takes care of this 
anomaly. 


Definition 7.1.1 Let {X;}n>1 and X be real random variables with respective 
cumulative distribution functions {Fn}n>1 and F. The sequence {Xn}n>1 is said 
to converge in distribution to X if 


Dea F(a) = F(x) for all continuity points of F, (7.1) 


where the point x € R is called a continuity point of the cumulative distribution 


function F on R if F(x) = F(a-). 
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This is denoted by: 
Ke 


EXAMPLE 7.1.2: MAGNIFIED MINIMUM. Let {Y,}n>1 be a sequence of IID 
random variables uniformly distributed on [0,1]. Then 


X, = nmin(¥,,..., Yn) ma E(1); 


(the exponential distribution with mean 1). In fact, for all x € [0,n], 
v n 
P(X, > =P( in(Yi,...,¥p >=) 2 P(¥ >=) =(1--) 
( x) min(Yj ) II = 
and therefore lim,;..5 P(X, > ©) =e" 1p, (2). 


For random vectors, another definition (which in the univariate case turns out 
to be equivalent; see Theorem 7.1.5) is needed. 


Definition 7.1.3 Let {X;}n>1 and X be random vectors of R?2. The sequence 
{Xn}n>1 is said to converge in distribution to X if for all continuous and bounded 
functions f :R4 > R, 

lim E[f(X,)] = E[f(X)]. 


ntoo 


The vectors X and X, (n > 1) need not be defined on the same probability 
space. Convergence in distribution concerns only probability distributions. As 
a matter of fact, very often, the X,,’s are defined on the same probability space 
but there is no “visible” (that is, defined on the same probability space) limit 
random vector X. Therefore one sometimes denotes convergence in distribution 
as follows: X;, Es Q, where Q is a probability distribution on R¢. If Q is a “famous” 
probability distribution, for instance a standard Gaussian variable, we then say, 
that “{X,}n>1 converges in distribution to a standard Gaussian distribution” , and 


denote this by: X, 3 N (0, 1). 
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Theorem 7.1.4 Let {X,,}n31 be a sequence of random vectors of R4 with respec- 
tive characteristic functions {pn}n>1- 


A. Suppose that there exists a function p such that 
= On =O. (2) 


If (0) = 1, this function is the characteristic function of a random vector 
X and {Xn}n>1 converges in distribution to X. 


B. In fact, a necessary and sufficient condition for {Xn}n>1 to converge in dis- 
tribution to some random vector X with characteristic function ~ is that 
(7.2) holds true. 


This result is the Paul Lévy criterion for convergence in distribution. Its (very 
technical) proof will be omitted as well as the proof of the next result.! 


Theorem 7.1.5 Jn the univariate case (d = 1), the conditions (7.2) and (7.1) are 
equivalent. 


The following result, Slutsky’s lemma, is often used. 


Theorem 7.1.6 Let {Xn}n>1 and {Yn}n>1 be sequences of real random variables 


such that Y,, Pv 0 and xe, eX for some real random variable X. Then Xn+Yn ee 
xe 


Proof. By Lévy’s criterion, we must show that limps. vx, +y, (uw) > Yx(u) for all 
u € R. Since 


[xn ty, (u) — Ux (U)l S [dxntyn(u) — Px, (u)| + lox, (u) — vx(u)l, 


and since by hypothesis (X,, ae 4 ) the second member of the right-hand side of 
the above inequality tends to 0, it remains to show that the first member tends to 
0. But the latter equals 


|B [ee — 1] | < |B [(e"* -)] |. 


Now, for any ¢ > 0 there exists a 6 > 0 such that |y| < 6 > |e" —1| < «. 
Therefore 


|Z [ce — 1)] | = |B [(e" — 1) 1 gyais63] | +E [Ce — 1) 1 gvasey] | 
< 2P(|¥a| > 6) te. 


| The classic reference is [2]. 
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Since Y,, = 0, limptoo P(|Yn| > 6) = 0. Therefore 


lim sup |E [(e™" —1)] | <e, 


ntoo 


and since ¢ is arbitrary, limypt.. |E [(e“¥ _ 1)] | = 0. 


Bochner’s Theorem 


This result is of paramount importance in the theory of wide-sense stationary 
processes (Chapter 12). 


The characteristic function y of a real random variable X has the following 


properties: 


A. it is hermitian symmetric (that is, e(—u) = y(u)*) and uniformly bounded 
(in fact, |p(w)| < (0)); 


B. it is uniformly continuous on R; and 


C. it is definite non-negative, in the sense that for all integers n, all wy, ..., 
Un € R, and all 2, ..., 2n € C, 


DID vluj — ux) z52% 2 0 


j=l k=1 


a2 
(just observe that the left-hand side equals E > z,e0*| ). 


It turns out that Properties A, B and C characterize characteristic functions 
(up to a multiplicative constant). This is Bochner’s theorem: 


Theorem 7.1.7 Let »: R > C be a function satisfying properties A, B and C. 
Then there exists a constant 0 < 6 < co and a real random variable X such that 
for allu ER, 

elu) = BE [e"™*). 


Proof. We henceforth eliminate the trivial case where y(0) = 0 (implying, in 
view of condition A, that y is the null function). For any continuous function 
z:R—-C and any A> 0, 


[ [ o(u—v)z(u)z*(v) dudv > 0. (x) 
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Indeed, since the integrand is continuous, the integral is the limit as n + co of 


£SS, (Act) (i). (MY 


j=l k=1 


a non-negative quantity by condition C. From (x) with z(u) := e~“", we have that 


1 [4 fA 
g(a, A) := sa i y(u—v)e*—) dudu > 0. 


Changing variables, we obtain the alternative expression 


where h(u) = (1 — Jul) Lgjaj<iy- Let M > 0. We have 


[ Gre g(a, A) dx 


=e fn(S) ow (faa )ewar) au 


1 TO) u sin Mu\? 
=oM f- n (5) oe ( 7 ) du. 
Therefore 
bac x 1 08 u sinMu\? 
<M = 
/ (s7) oe =M f n(S)I «| ( Mu ) 7 
1 +20 /siny\? 

a= du = 
<toy [ (S*) a= 90 


By monotone convergence, 


lim [os (=) g(x, A) dx = [oa da, 


Mtco Jo, 


and therefore 
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The function z +> g(a, A) is therefore integrable and it is the Fourier transform of 
the integrable and continuous function ur h (4) y(u). Therefore, by the Fourier 
inversion formula: 


h (=) plu) = [oe Aje™ da. 


[oe} 


In particular, with u = 0, fae g(x, A) dx = (0). Therefore, f(a, A) := oa is 
the probability density of some real random variable with characteristic function 


h (+) an. But 


lim h 
Atoo 


A? (0) (0) ” 
This limit of a sequence of characteristic functions is continuous at 0 and is there- 
fore a characteristic function (Paul Lévy’s criterion, Theorem 7.1.4). 


e plu) — plu) 


7.2 The Central Limit Theorem 


This is the emblematic theorem of Statistics. 


Theorem 7.2.1 Let {Xn}n>1 be an ID sequence of real random variables such 
that 
BX: OOP (Ge3)) 


(In particular, E||X|] < co.) Then, for alla € R, 


en ZG). (7.4) 


The random variable in the left of (7.4) is obtained by centering the sum S,, 
(subtracting its mean nE[X,]) and then normalizing it (dividing by the square 
root of its variance so that the resulting variance equals 1). 


Proof. Assume without loss of generality that E[X,] = 0. Let o? be the variance 
of X,. By the characteristic function criterion for convergence in distribution, it 
suffices to show that 


lim y,(u) = ere /2 
ntoo 


where 
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where w is the characteristic function of X;. From the Taylor expansion of ~ about 
Zero, 


eo ) 2 


Ulu) =1+ Su? + ow’), 


we have, for fixed u € R, 


and therefore 


ou" 1 1, 
lim | i =i In< 1— — =— 07 u?. 
i n{yp(u)} iim ( nf om + o(=) \) 5° u 


The result then follows by Theorem 7.1.4. 


EXAMPLE 7.2.2: FAST SAMPLING OF THE POISSON DISTRIBUTION. In the case 
of a Poisson distribution with mean 6, the method of the inverse works as follows: 
letting p; :=e ae , sample a random variable U uniformly distributed on [0,1] and 
set T = k if U falls in the interval [; := ee pe, p;|. The crude version of 
this sampling algorithm consists in examining the intervals J; sequentially until 
one is found that contains U. This would require on average 1+ E[T] = 1+0 
trials. If 6 is very large, a more economical procedure is available. It takes into 
account the fact that the probability mass of a Poisson variable is maximal at a 
value ig near the average value and decreases as one get farther away from this 
value. The exploration starts with the value 7g, and then proceeds to 7p — 1, 79 +1, 
ig — 2, i9 + 2, etc. The average number of trials is then roughly equal to 


T-9| 
1+E(\T oi] =1+ VoE | |. 
(Ir - al = 
By the central limit theorem, 7 := tei is approximately distributed as a standard 
Gaussian variable. Therefore the average number of trials for large 0 is approxi- 
mately 


1+ VOE||N(0, 1)|] ~ 1+0.82V6. 


The central limit theorem admits a multidimensional version. 
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Theorem 7.2.3 Let {Xn}n>i be a sequence of independent random vectors of 
dimension d, and let {an }n>1 be a sequence of real numbers such that limptoo An = 
co. Suppose that 
Le 
and 
Ne 
Let g : R¢ > RY? be a function twice continuously differentiable in a neighborhood 
U of m. Then 
g(Xn) > g(m) 
and Fe 
Van(9(Xn) — glm)) + N (0, Jg(m)"T Jg(m)) , 
where J,(m) is the Jacobian matrix of g evaluated at m. 


Proof. U can be chosen convex and compact. Let g; denote the j-th coordinate 
of g, and let D?g; denote the second differential matrix of g;. By Taylor’s formula, 


g,(2t) — gj(rm) = (a — m)P (grad gy(0m)) + (a — m)™ DPgj(m") (a — m) 


for some m* in the closed segment linking m to 2, denoted [m,x]. Therefore, if 


XxX, €U 


Van 95 (Xn) — 95(m)) = VGn(Xn — m)" (grad g,(m)) 


1 1 
son Xn m)? TaD gil) (X, —™m), 


where m* € [m, Xj]. 


Suppose X, € U. Since U is convex and m € U, also m* € U. Now since U is 
compact, the continuous function D?g; is bounded in U. Therefore, since a, t 00, 
Fa D? 9; (mr, )1u(Xn) —+ 0. Since X, “3 m, we deduce from the above remarks 
that 

VGn(9j(Xn) — 95(m)) — Van(Xn — m)" (grad gj(m)) 30, 


and therefore 
Van(9(Xn) — g(m)) — Jg(m)Van(Xn — m) + 0. 
But /a,(X, — m) , N (0,1) and therefore 


Van(g(Xn) — g(m))  Jg(m)N (0,0) =N (0, Jg(m)? TP Jg(m)) . 
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Confidence Intervals 


We now briefly introduce a basic methodology of Statistics with the notion of 
confidence interval. 


The central limit theorem (7.2.1) implies that for x > 0, 
lim P (2 [Xi] - 


For << BX] + er) = PID) < 2). 


Under the condition EF [|X,|?] < 00, this limit is uniform in x € R (we shall admit 
this result, called the Berry—Essen theorem) and therefore, with at =a, 


hae Ges -a< < E[X] +a) =P (mon) < a : 
That is, for large n, 
P Ges -a< * < E[X)] +a) ~P (Ivo 1)| < — : 


In other words, for large n, the SLLN estimate of E[X,], that is Sa, lies within 
: ; ae ; ayn 
distance a of E[X,] with probability P (IN: 1)| < wih), 
In statistical practice, this result is used in two manners. 


(1) One wishes to know the number n of experiments that guarantee that with 
probability, say 0.99, the estimation error is less than a. Choose n such that 


p(wion)) <2) = o.99, 


Since 


P(\N (0; 1)| < 2.58) = 0.99, 


we have 


and therefore 


(2) The (usually large) number n of experiments is fixed. We want to determine 
the interval [22 — 4, + al within which the mean [Xj] lies with probability at 
least 0.99. From (7.5): 

2.580 


a 


— 
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If the standard deviation o is unknown, it may be either replaced by an SLLN 
estimate of it (but then of course...), or the conservative method can be used, 
which consists of replacing 0 by an upper bound. 


EXAMPLE 7.2.4: TESTING A COIN. Consider the problem of estimating the 
bias p of a coin. Here, X,, takes two values, 1 and 0 with probability p and 1 — p 
respectively, and in particular E[X,] = p, Var (X,) = 0? = p(1—p). Clearly, 
since we are trying to estimate p, the standard deviation o is unknown. Here the 
upper bound of o is the maximum of ,\/p(1 — p) for p € [0,1], which is attained 
for p= 3. Thus o < s. 

Suppose the coin was tossed 10,000 times and that the experiment produced 
the estimate Sn = 0.4925. Can we “believe 99 percent” that the coin is unbiased? 


For this we would check that the corresponding confidence interval contains the 


value 3. Using the conservative method (not a big problem since obviously the 
actual bias is not far from 4), we have 
2.58 
Gao {HOT 


vn 


and indeed $ € [0.4925 — 0.0129, 0.4925 — 0.0129], so that we are at least 99 per- 
cent confident that the coin is unbiased. 


7.3 Convergence in Variation 


This notion is introduced in the discrete time setting, since this will be sufficient 
for the study of convergence (in variation) of a Markov chain (see Chapter 9). 


Definition 7.3.1 Let E be a countable space. The distance in variation between 
two probability distributions a and 8 on E is the quantity 


dv(a, 8) = 5d lali) - BOL. (7.6) 


icE 
That dy is indeed a distance is clear. 


Lemma 7.3.2 Let a and £ be two probability distributions on the same countable 
space EF}. Then 


dy(o, 8) = sup{la(A) — B(A)I} 


= sap tat4) — B{A)}. 
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Proof. For the second equality observe that for each subset A there is a subset B 
such that |a(A) — 6(A)| = a(B) — 6(B) (take B = A or A). For the first equality, 
write 


a(A) — B(A) = S23 La(i){a(i) — 6} 


icE 


and observe that the right-hand side is maximal for 
A={ieE E; afi) > Bi}. 


Therefore, with g(i) = a(i) — 6(i), 


oes) A= 9 @= > l9(®)| 


icE 2k 


since ) <n g(t) = 0. 


The distance in variation between two random variables X and Y with values 
in EF is the distance in variation between their probability distributions, and it is 
denoted (with a slight abuse of notation) by dy(X,Y). Therefore 


V(X, Y) = 5 Px PY =i)|. 


2th 


The distance in variation between a random variable X with values in E and a 
probability distribution a on E denoted (again with a slight abuse of notation) by 
dy(X, qa) is defined by 


) =F DIP =a - afi). 


iCE 
We now introduce the notion of coupling. 


Definition 7.3.3 The coupling of two discrete probability distributions 7’ on E’ 
and «” on E" consists, by definition, of the construction of a probability distribu- 
tion m on EB := E" x EB" such that the marginal distributions of x on E" and BE" 
respectively are t’ and x", that is, 


> m(i,j) =7'(i) and S w(i,j) = 7"(9). 


jeB" ic EB! 
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For two probability distributions a and { on the countable set EL, let D(a, @) 
be the collection of random vectors (X,Y) taking their values in F x FE and with 
given marginal distributions a and {@, that is, 


P(X =i) =a(i), P(Y =1) = Bi). (7.7) 


Theorem 7.3.4 For any pair (X,Y) € D(a, 8), we have the fundamental cou- 
pling inequality 

dy a8) = P(e ey). 
and equality is attained by some pair (X,Y) € D(a,8), which is then said to 
realize maximal coincidence. 
Proof. For arbitrary A Cc E, 


P(X #Y)> P(X €A,Y € A) 
= P(X € A)— P(X €A,Y EA) 
> P(X € A)— P(Y € A) 


and therefore 


P(X £Y) > sup{P(X € A) - P(Y € A)} = dy(a, 8). 


ACE 


We now construct (X,Y) € D(a, 8) realizing equality. Let U,Z,V, and {W (t) }iepo,1] 
be independent random variables; U takes its values in {0,1}, and 7,V,W take 
their values in FE. The distributions of these random variables is given by 


PU =1) 1—dy(a, 8), 

P(Z=i) = ali) AB(i)/ (1 - dy(a,8)) , 
PWV =i) = (a(i) — 8(@))*/dv(a,8), 
P(W =i) = (B(i)— a(i))*/dy(a, 6). 


Observe that P(V = W) = 0. Defining 


(X,Y) =(Z,Z)ifU =1 
= (V,W) if U =0, 


we have 


P(X =i) = PU =1,Z=1)+P(U =0,V =i) 
= P(U =1)P(Z =i) + P(UU =0)P(V =) 
= a(i) A A(t) + (ai) — B(i))* = a(t), 
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and similarly, P(Y =i) = §(i). Therefore, (X,Y) € D(a, 8). Also, P(X = Y) = 
P(U = 1) =1-dy(a, §). 


EXAMPLE 7.3.5: POISSON’S LAW OF RARE EVENTS, TAKE 2. Let Yj,...,Y, be 
independent random variables taking their values in {0,1}, with P(Y; = 1) = ™, 
1<i<n. Let X = OY; and \:= 37, mj. Let py be the Poisson distribution 
with mean A. We wish to bound the variation distance between the distribution q 
of X and p,. For this we construct a coupling of the two distributions as follows. 
First we generate independent couples (Yi, Y/),..-, (Yn, Y,) such that 


ea if j =0,k =0, 
P(Y%;,=5,¥/ =k) = 4 emt if j= 1,k > 1, 
e™—(l-—7m) iff=1,k=0. 


One verifies that for all 1 <i <n, P(Y; = 1) = 7; and Y/ ~ Poi(z;). In particular 
X' := ¥°_, Y/ is a Poisson variable with mean A. Now 


P(X # X') -P (Sued) 


P(Y; £Y; for some 7) =<) PY) 


i=l 


But 


PAY!) =e" —(1—m) + PO > 1) 
= (1—e-") < 77. 


Therefore P(X # X"') <)>), 7? and by the coupling inequality 
dy(q,p,) < So 77. 
i=1 


For instance, with 7; = p:= 4 we have 


2 
dy(q,prx) < —. 
n 


In other terms the binomial distribution of size n and mean . differs in variation 
by less than * from a Poisson variable with the same mean. This is obviously a re- 
finement of the Poisson approximation theorem since it gives exploitable estimates 
for finite n. 
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Definition 7.3.6 A sequence {Xn}n>1 of discrete random variables with values 
in E is said to converge in distribution to the probability distribution 1 on E if 
for alli € E, limyy.. P(X, = i) = 7(2). It is said to converge in variation to this 
distribution if 

lim |P(X, —n(i)| =0. (7.8) 


ice 


Observe that Definition 7.3.6 concerns only the marginal distributions of the 
stochastic process, not the stochastic process itself. Therefore, if there exists an- 


other stochastic process {X/},>0 such that X,, ce Xj for all n > 0, and if there 
exists a third one {X/’},,>9 such that X/” RX x for all n > 0, then (7.8) follows from 


nee dy (Xj), XV") =0. (7.9) 
This trivial observation is useful because of the resulting freedom in the choice of 


{X/} and {X"’}. An interesting situation occurs when there exists a finite random 
time 7 such that Xj}, = X” for all n > 1. 


Definition 7.3.7 Two stochastic processes {X!}n>9 and {X"},>0 taking their val- 
ues in the same state space E are said to couple if there exists an almost surely 
finite random time T such that 


n>TsX =x". (7.10) 


The random variable T is called a coupling time of the two processes. 


Theorem 7.3.8 For any coupling time T of {X!}n>o and {Xi }ns0, we have the 
coupling inequality 


HAS, 30) S Ae > m)- (7.11) 
Proof. For all A C EF, 
P(X) € A)— P(Xfe A) = P(X) EA, 7 <n) +P(X),€ A, 7 >7n) 
— P(XeEA,tT<n)- PLE A, T>n) 
= P(X, EA, 7tT>n)—P(X EA, tr > 7) 
< P(X) €A,7T>n)< P(r >n). 


Inequality (7.11) then follows from Lemma 7.3.2. 


Therefore, if the coupling time is P-a.s. finite, that is limpto P(t > n) = 0, 
lim dy(Xn,7) = finn dy(X/,, X)") =0. 
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Definition 7.3.9 (A) A sequence {an}n>0 of probability distributions on E is said 
to converge in variation to the probability distribution B on E if 


a dy (ap, 8) =0. 


(B) An E-valued random sequence {Xn}n>0 such that for some probability dis- 
tribution 7 on E, 
lim dy(X,, 7) = 0, (7.12) 


ntoo 


is said to converge in variation to 7. 


7.4 The Rank of Convergence in Distribution 


Convergence in distribution is weaker than almost-sure convergence. This means 
the following. 


Theorem 7.4.1 If the sequence {Xn},,., of random vectors of R* converges al- 
most surely to some random vector X, it also converges in distribution to the same 
vector X. 


Proof. By dominated convergence, for all u € R, 


lim B [eteXn)] — Bete X)) | 


which implies, by Paul Lévy’s theorem (Theorem 7.1.4), that {X,},,., converges 
in distribution to X. 7 


In fact, convergence in distribution is even weaker than convergence in proba- 
bility. 
Theorem 7.4.2 If the sequence {X,,},,., of random variables converges in prob- 
ability to some random variable X, it also converges in distribution to X. 


Proof. If this were not the case, one could find a bounded continuous function f 
such that Ef (X;,)] does not converge to E|f(X)]. In particular, there would exist 
a subsequence n, and some ¢ > 0 such that |E[f(Xn,)] — E[f(X)]| > € for all k. 
As {X,,, }e>1 converges in probability to X, one can extract from it a subsequence 
{Xn,, }e>1 converging almost surely to X. In particular, since f is bounded and 
continuous, lime E[f(Xn,,] = E[f(X)] by dominated convergence, a contradiction. 
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Combining Theorems 6.4.7 and 7.4.2, we have that convergence in distribution 
is weaker than convergence in the quadratic mean: 


Theorem 7.4.3 If the sequence of real random variables {Z,},., converges in 
the quadratic mean to some random variable Z, it also converges in distribution 
to the same random variable Z. 


Convergence in distribution is weaker that convergence in variation: 
Theorem 7.4.4 If the sequence of real random variables {Xy}n>1 converges in 
variation to X, it converges in distribution to the same random variable. 

Proof. Indeed, for all x (not just the continuity points of the distribution of X), 
|P(X, <2) — P(X <2)| < dy(Xn,X) 30. 


A Stability Property of the Gaussian Distribution 


Theorem 7.4.5 Let {Z,},5,, where Z, = Ce aed 7 be a sequence of 
Gaussian random vectors of fized dimension m that converges componentwise in 
the quadratic mean to some vector Z = (Z™, ..., Z™). Then the latter vector is 
Gaussian. 


Proof. In fact, by continuity of the inner product in L2(P), for all 1 < i,7 <_m, 
limptos [ZZ] = E[ZOZ) and limyroo E[Z] = E[Z], that is 


lim mz,, =mz, lim Tz, =Tyz 
ntoo ntoo 


and in particular, for all u€ R”, 
lim # fei" = Tim eft H2n— gu Tan 
ntoo ntoo 


oP ye a4? 
= eit HzZ—5U Dzu- 


The sequence {uP Zn}nd1 converges in the quadratic mean to u2Z, and therefore 
fez 


? 


it also converges in distribution to u?Z. Therefore, limp; E jet = Ele 


and finally 
Ele’ 7] = eit ez—su zu 


for all u€ R™. This shows that Z is a Gaussian vector. 


Therefore, limits in the quadratic mean preserve the Gaussian nature of ran- 
dom vectors. This is the stability property referred to in the title of this subsection. 
Note that the Gaussian nature of random vectors is also preserved by linear trans- 
formations, as we already know. 
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Skorokhod’s Theorem 


There is an at first sight surprising connection between convergence in distribution 
and almost-sure convergence exemplified in the following example and generalized 
by the theorem following it. 


EXAMPLE 7.4.6: CONVERGENCE IN DISTRIBUTION OF EXPONENTIAL RANDOM 
VARIABLES. Consider the sequence of exponential CDFs 


F(x) =(L—-e*)liso (n> 1), 


where {A,,}n>1 is a sequence of positive numbers converging to the finite positive 


number A. Obviously F;, 7, F where F(x) = (1—e™*)1,50. Let now X be an 
exponential random variable with parameter 4. The random variables 


Xa X (n > 1) 


have the respective CDF F,, (n > 1) and obviously X,, “3 X. 


The next result, Skorokhod’s theorem, generalizes the previous example. 


Theorem 7.4.7 Let {F,}n>1 be the CDFs of a sequence {Xy}n>1 of random vari- 
ables converging in distribution to a random variable X with the CDF F. There 
exists a sequence {Yn }n>1 of random variables with the CDF {F;,}n>1 that converges 
almost surely to a random variable X with the CDF F. 


Proof. Let Fi and F* denote the generalized inverses of F,, and F’ respectively 
(see Theorem 3.2.30). The sequence we are looking for will be defined on the prob- 
ability space (Q, F, P) := ((0,1), B((0,1)), 2), where @ is the Lebesgue measure. It 
is defined by Y,,(w) := F*(w), that is? 


n 


In order to prove that Y,, — Y, it suffices to show that F’"(u) > F(u) for 
all wu € (0,1) where F* is continuous, since the complement of such points is of 
null Lebesgue measure. 


? With a change of notation that will maybe avoid confusion. 
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Let u be such a point. Let C be the set of points of continuity of F’. If a,b € C, 
a <b, are such that 
a<F(u) <b, (x) 


then there exists a v, u <u < 1, such that a < F*(u) < F*(v) < 3, that is, 
F(a) <u<vu< F(b). 


Since a,b € C, a < b, for large enough n, F,,(a) < u < F;,,(b), that is, 


a<F(u) <b. (xx 


The conclusion then follows from (x) and (xx). 


7.5 Exercises 


Exercise 7.5.1. AUTOREGRESSIVE GAUSSIAN MODEL, TAKE 2 
This is a continuation of Exercise 3.6.32. 


3. Show that X, converges in distribution to a centered Gaussian variable of mean 
0 and variance 7 to be computed. 


4. Suppose now that Xo is Gaussian with mean 0 and variance y? as computed in 
the previous question. Show that {X,},,59 is a strictly stationary sequence, in the 
sense that for all n, (Xx, Xp41,---,Xk+n) has a distribution independent of k. 


Exercise 7.5.2. POISSON’S LAW OF RARE EVENTS IN THE PLANE 

With A a positive real number, let 7,,...,Z,, be 11D random vectors uniformly 
distributed in the square P'y := [0, A] x [0, A]. Define for any set C C Ty, N (C) 
to be the number of random vectors Z; that fall in C. Let C),...,Ck be disjoint 
bounded subsets of R?. 


Let M = M (A) be a function of A such that 
M (A) 
A2 


Show that, as A t 00, (N(Ci),..., N (Cx)) converges in distribution. Identify 
the limit distribution. 


= A>.0. 


Exercise 7.5.3. A CHARACTERISTIC PROPERTY OF THE GAUSSIAN DISTRIBU- 
TION 
Let G be a cumulative distribution function on R such that 


[ 2ac(@) =0 ana [ tec) =1. 


R 
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In addition, suppose that G has the following property: If X; and X9 are indepen- 
dent random variables with the CDF G, then 442 also admits G as CDF. Prove 
that G is the CDF of a Gaussian variable with mean 0 and variance 1. 


Exercise 7.5.4. MIXED MOMENTS OF A GAUSSIAN VECTOR 

Let X = (X,...,X,)" be a centered (0-mean) n-dimensional Gaussian vector 
X = (X,...,Xn)? with the covariance matrix [T = {o;;}. Prove the following 
formula: 


EX, Xin, fae hie = o> Ojrj2F jaja +++ Tiondor > (7.13) 


Gases d2) 
JL <J25++)J2k—-1<J 2k 


where the summation extends over all permutations (j1,..., jax) of {t1,..., 22%} 
such that 7, < jo,...,Jor—1 < jor. There are 1-3-5...(2k— 1) terms in the right- 
hand side of Eq. (7.13). The indices 71,...,i2, are in {1,...,n} and they may 
occur with repetitions. Also prove that the odd moments of a centered gaussian 
vector are null, that is: 


LX, 


|| =0, 


E[X 


ice 


for all (i1,..-,t2n41) € {1,2,...,n}°**1. Apply the above to compute the quanti- 
ties E[X;XpX3X4], E[X2X2], ELX4] and ELX?4]. 


Exercise 7.5.5. SERIES SUMMATION VIA THE CENTRAL LIMIT THEOREM 
Prove, using the central limit theorem, that 


= nk 1 
li WY 
a ae 


Exercise 7.5.6. g(X;) > g(X) 
Let {X,}n>1 and X be random variables such that X,, 4 X,andletg:R-R 
be a continuous function. Prove that g(Xn) = g(X). 


Exercise 7.5.7. CAUCHY TRICKS 

Let {X,}n>1 be a sequence of 11D Cauchy random variables. 
(a) What is the limit in distribution of “++? 

(b) Does “++*» converge in distribution? 


(c) Does ++*2 converge almost surely to a (nonrandom constant)? 
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Exercise 7.5.8. CONVERGENCE IN DISTRIBUTION, BUT NOT IN PROBABILITY 
Let Z be a random variable with a symmetric distribution (that is, Z and —Z 
have the same distribution). Define the sequence {Z,,}n>1 as follows: Z, = Z if n 
is odd, Z, = —Z if n is even. In particular, {Z,}n>1 converges in distribution to 
Z. Show that if Z is not the constant 0, then {Z,}n>1 does NOT converge to Z in 
probability. 


Exercise 7.5.9. CONVERGENCE IN PROBABILITY AND CONVERGENCE IN VARI- 
ATION 

Let {Zn}n>o be a sequence of {0,1}-valued random variables. Show that it con- 
verges in variation to 0 if and only if it converges in probability to 0. 


Exercise 7.5.10. CONVERGENCE IN PROBABILITY BUT NOT IN DISTRIBUTION 
Give an example of a sequence of random variables that converges in probability 
but not in distribution. 


Check for 
updates 


Chapter 8 


Martingales 


A martingale is for the general public a clever way of gambling. In mathematics, 
it formalizes the notion of fair game and we shall see that martingale theory 
indeed has something to say about such games. However the interest and scope 
of martingale theory extends far beyond gambling and has become a fundamental 
tool of the theory of stochastic processes. The present chapter is an introduction to 
this topic, featuring the two main pillars on which it rests: the optional sampling 
theorem and the convergence theory of martingales. 


8.1 The Martingale Property 


Let (Q,F,P) be a probability space and let {F,}n>1 be a history (or filtration) 
defined on it, that is, a sequence of sub-o-fields of F that is non-decreasing: F;, C 
Fri (n > 0). The internal history of a random sequence {X;,}n>0 is the filtration 
{FX}n>0 defined by FX := o(Xo,..-, Xn). 


Definition 8.1.1 A complex random sequence {Yn}n>o0 such that for alln > 0 
(i) Yn ts Fp-measurable and 
(ii) El|Yn|] < 00 


is called a (P, F,,)-martingale (resp., submartingale, supermartingale) if, in addi- 
tion, for alln > 0, P-almost surely, 


ElYn4u1 | Fi = Yn (resp., = ae < Y;,) : (8.1) 


When the context is clear as to the choice of the underlying probability mea- 
sure P, we shall abbreviate, saying for instance, “F,-submartingale” instead of 
“(P, F,,)-submartingale” . 
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If the history is not mentioned, it is assumed to be the internal history. For 
instance, the phrase {Y;,}n>0 is a martingale means that it is an F*-martingale. 


Of course an F,,-martingale is an F,,-submartingale and an F,—supermartingale. 
Condition (8.1) implies that for all k > 1, alln > 0, 
ElYn+k | F,| _ Yn (resp., = Yn < Yn). 


Proof. In the martingale case, for instance, by the rule of successive conditioning 
E[Yn+e | Fr = EE nn | Prk Len 

E\YntK—1 | Fr] = E\Yn+e—2 | Fs 

=n BI =. 


In particular, taking expectations and letting n = 0, 


B[Y] = El%] — (resp., > E[%], < E[Yl). 


EXAMPLE 8.1.2: SUMS OF IID RANDOM VARIABLES. Let {Xn}n>o be an IID 
sequence of centered and integrable random variables. The random sequence 


Y, = Xo +X, +-:-+X, (n> 0) 


is an FX-martingale. Indeed, for all n > 0, Y;, is £X-measurable and 
ElYn+1 | Fa = E\Yn | Fal + E[Xn41 | Fo) =¥Y,+ E[Xn+1] =¥n, 


where the second equality is due to the fact that F* and X,,4, are independent 
(Theorem 5.6.5). 


EXAMPLE 8.1.3: PRODUCTS OF IDS. Let X = {Xy}n>0 be an IID sequence of 
integrable random variables with mean 1. The random sequence 


Yn =| [X. (n> 0) 
k=0 


is an FX-martingale. Indeed, for all n > 0, Y,, is FX-measurable and 


n 


Xn4i | | Xe l FE 


k=0 


= E[Xnsi| Fn] [] Xe 


k=0 


E[You| Fe] = £ 


= 2X) | [ele aris 


k=1 
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where the second equality is due to the fact that F* and X,,4; are independent 
(Theorem 5.6.5). 


EXAMPLE 8.1.4: GAMBLING. Consider the random sequence {Yp},,59 with val- 
ues in R, defined by Yo = a € Ry and 


Yat = Yn “Te Xn41 bn4i( Xo) (n 2 0) ’ 


where Xf := (Xo,.-.,Xn), Xo = Yo, {Xn},5, is an UD sequence of random 
variables taking the values +1 or —1 with equal probability, and the family of 
functions b, : {0,1}" + N (n > 1) is the betting strategy, that is, bn4i1(XG) 
is the stake at time n + 1 of a gambler given the observed history F* of the 
chance outcomes up to time n. Admissible bets must guarantee that the fortune 
Y, remains non-negative at all times n, that is, Bn4i( XG) < Y,. The process so 
defined is an F*-martingale. Indeed, for all n > 0, Y,, is F*-measurable and 


#\Y,4a |F7 | = 2 | | Fe | + 2 ade | 
— Y,+£E [Xntt | Fe | brit (XQ) = Yrs 
where the second equality uses Theorem 5.6.9. The integrability condition should 


be checked on each application. It is satisfied if the stakes b,(X{) are uniformly 
bounded. 


EXAMPLE 8.1.5: HARMONIC FUNCTIONS OF AN HMC. Let {Xn}n>0 be an HMC 
with countable space FE and transition matrix P. A function h: E > R is called 
harmonic (resp., subharmonic, superharmonic) if Ph is well defined and 


Ph=h (resp., >h,<h), (8.2) 


that is, 
So pish(s) = h(i) (resp., > A(H), SAW) (GEE). 
jek 

Superharmonic functions are also called excessive functions. 


Equation (8.2) is equivalent, in the harmonic case for instance, to 
Bih(Xni1) | Xn =i] =h(t) (EB). (x) 


In view of the Markov property, the left-hand side of the above equality is also 
equal to 
E(h(Xn41) | Xn = 1, Xn-1 = in—1) tee , Xo = io] ) 
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and therefore (x) is equivalent to 
E[A(Xn+1 | Fn] = h(Xn). 


Therefore, if E[|h(X,,)|] < oo for all n > 0, the process {h(X,)}n>0 is an FX- 
martingale. Similarly, for a subharmonic (resp. superharmonic) function h such 
that E[|h(X,,)|] < co for all n > 0, the process {h(X;,)}n>o is an F*-submartingale 
(resp. FX-supermartingale). 


Definition 8.1.6 Let {Fy }n>0 be some filtration. A (P, F,,)-martingale difference 
(resp., submartingale difference, supermartingale difference) is, by definition, a 
complex random sequence {Xn}n>o such that for all n > 0, 


(a) Xp, ts F,-measurable, 
(b) E[|X,|] < oo and E[X,] = 0, and 
(c) E[Xn41 | Fn] = 0 ( resp. > 0, < 0). 
The notion of martingale difference generalizes that of centered IID sequences. 


Indeed for such 11D sequences, X,, is independent of FX, and therefore (Theorem 


56.5) HIM 4 | Fo |= 0. 
Convex Functions of Martingales 


Theorem 8.1.7 Let I C R be an interval (closed, open, semi-closed, infinite, 
etc.) and let p: I + R be a convex function. 


A. Let {Yn kaso be an F,-martingale such that P(Y, € I) = 1 for all n > 0. 
Assume that E ||‘p(Yn)|] < 00 for alln > 0. Then, the process {p(Yn)}nso #8 
an F,-submartingale. 


B. Assume moreover that y is non-decreasing and suppose this time that 
{Ynknso 18 an Fp-submartingale. Then, the process {p(Yn)}nso tS an 
F,,-submartingale. 


Proof. By Jensen’s inequality for conditional expectations (Exercise 8.6.1), 


E p(Yn+1)|Fn] = 9(E [YntilFn)) - 


Therefore (case A) 


E [py(Ynti)|Fr] = 9(E [YntilFal) = 9(Yn), 
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and (case B) 
Elp(Yn+1)|Fn] 2 P(E [YnsilFal) 2 (Yn) - 


(For the last inequality, use the submartingale property F [Y;,+1|F,] > Y;, and the 
hypothesis that y is non-decreasing.) 


EXAMPLE 8.1.8: Let {Y,},59 be an F,-martingale and let p > 1. As a special 
case of Theorem 8.1.7 with the convex function x — ||’, we have that if E [|Y,,|?] < 
00, {|¥n|?},59 is an F,-submartingale. Applying Theorem 8.1.7 with the convex 
function «++ «+, we have that {Y,"},., is an F,-submartingale. 


Martingale Transforms and Stopped Martingales 


Let {F}n>o0 be some filtration. The complex stochastic process {H;,}n>1 is called 
F,- predictable if 
H,, is Fy-1-measurable for alln > 1. 


Let {Y;,}n>0 be another complex stochastic process. The stochastic process 


(H ° Y)n = ‘ Ai (Yp = Y,-1) (n > 1) 


k=1 
is called the transform of Y by H. 
Theorem 8.1.9 


(a) Let {Yn}nso0 be an F,,-submartingale and let {Hn}n>0 be a bounded non- 
negative F,, predictable process. Then {(HoY )n}n>o is an F,-submartingale. 


(b) If {Y¥n}nso is an F,-martingale and if {Hn}ns0 1s bounded and F,- 
predictable, then {(H OY)n}nso 18 an F,-martingale. 


Proof. Conditions (i) and (ii) of (8.1.1) are obviously satisfied. Moreover, 


(a) B(GT oY au — (U1 °oY).| 7) = 2a — Ya) | 7] 
= nt [Yn _ a | Fn] = 0, 


using Theorem 5.6.9 for the second equality. 


(b) El(HOV)n4i — (HOY )n| Fp] = Anyi E [Yas — Yn | Fn] = 0, 


by the same token. 


272 CHAPTER 8. MARTINGALES 


Definition 8.1.10 Let {Fy}nen be a non-decreasing sequence of sub-o-fields of 
F. A random variable tT taking its values in N and such that, for allm € N, the 
event {r =m} is in Fm is called an F,,-stopping time. 


In particular, 7 is an F-stopping time if, for all m € N, the event {r = m} 
can be expressed as 


lt=m} = Um(Xo, ee | Xm) > 

for some measurable function w,, with values in {0,1} (Theorem 5.6.3). This 
explains why stopping times are said to be non anticipative. 

Theorem 8.1.11 Let {F,}n>0 be a history and let Fy, := o(Uns0Fn). Let 7 be 
an F,,-stopping time. The collection of events 

fF, =A € Fs | Alte =a} € Foy jor all ma = i} 

is ao-field, and rT is F,-measurable. Let {Xn}n>0 be an E-valued F,,-adapted ran- 
dom sequence, and let r be a finite F,,-stopping time. Then X(r) is F,-measurable. 


The proof is left as an exercise. 


If {Fn}n>o is the internal history of some random sequence {X,,}n>0, that is, 
if F, = FX (n > 0), one may interpret F* as the collection of events that are 


determined by the observation of the random sequence up to time 7 (included). 
Theorem 8.1.9 immediately leads to the stopped martingale theorem: 


Theorem 8.1.12 Let {Y,}n>0 be an F,-submartingale (resp., martingale) and 
let r be an F,-stopping time. Then {Ynrr}n>o0 is an F,-submartingale (resp., 
martingale). In particular, 


E[Yone] > BLY] (resp., = B[%]) (n> 0). (3.3) 


Proof. Let Hy := lm<r}. The stochastic process H is F,—predictable since 
{H,, = 0} = {7 <n—1} © F,_1. We have 


NAT 
Ynar = Yo+ > (Ys — Ye-1) 
k=1 
=Yot+ ~ Lace} (Ya — Yu-1) - 
k=1 


The result then follows by Theorem 8.1.9. 
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8.2 Martingale Inequalities 


Kolmogorov’s Inequality 


Theorem 8.2.1 Let {S,}n>0 be an F,-submartingale. Then, for all \ € Ry, 


MP (gua Ss a) UA Slee ey Ie (8.4) 
Proof. Define the random time 
T =inf{n>0; 5S, >A}. 
It is an F,,-stopping time since 
Av:={r=it}= {s > A, max. 5; < a} SF is 


The A;’s so defined are mutually disjoint and 


A:= {quax 5 = ab = oe 


Since Aly, < Sjla,, 


3 
3 


For all 0 <i <n, A; being F;-measurable, we have by the submartingale property 
that E'[S,, | 7] > S; and therefore Ty S;dP < ty E|S,|F;)dP. Taking these 


observations into account, 
AP(A) < 5° E[Sila] < 0 FE [B*[S,]14,] 
i=0 i=0 


= 3 E [E* [Sn1a,]]| = oy E[S,14,] 


i=0 
=E1S, >. 1a,| = 2[Sily). 
i=0 
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Corollary 8.2.2 Let {Mn}n>o be an F,-martingale. Then, for all p > 1, all 
AER, 


PP (gua |M;| > a) < E||M,|]. (8.5) 


<i 


Proof. Let S, = |M,|?. This defines an F,,-submartingale (Example 8.1.8) to 
which one may apply Kolmogorov’s inequality with A replaced by A?: 


PP (asx |M;|? > »*) < E [| Taeenaieg oc, IMije>a?}| < E||M,|?] : 


Doob’s Inequality 


Recall the notation || X ||,:= (E [|X|?])'”. 
Theorem 8.2.3 Let {Mn}nso be an F,-martingale. For all p > 1, 


I| Mn Ilp $l] max [Mil IlpS |] Mn Ilo, (8.6) 


where q (the “conjugate” of p) is defined by sk : = 1k 


Proof. The first inequality is trivial. For the second inequality, observe that for 
all non-negative random variables X, by Fubini’s theorem, 


E[X?]=B | | ° px?! az] 


=f / pa Vie ex} az] =r | a? "P(X > x)dz. 
0 0 


Therefore, applying this and Kolmogorov’s inequality (8.4) to the submartingale 
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P 
(gu 1M) | 
0<i<n 


qP-ip (gu |M;| > s) dx 
O<i<n 


& 
SI 
IAB 
= 
Aba 
= 
ot 
Ss 
a 7 
IA 
& 
_ 


we # va lier eiceen IMs[>2}| dx 


ar ier 


=pE ff a?-?| M,| Lpmaxocicn |M;|>2} az] 
0 


maxg<i<n |Mi| 
|/,.| . oP? dx 
0 


p-l 
[My (guax |) | 
0<i<n 
p-l 
|M,,| (goa IM) | : 
O0<i<n 
By Holder’s inequality, and observing that (p — 1)q = p, 
soi (p-1)q] 4 
) (gua Ml) | 
O<i<n 


pj 1/4 
=| Molle £ | (quax inal) | 
O<i<n 


py 1/4 
E max Mr <q|| Mn lp £ (gaa 1M) | ’ 
O<i<n OStSn 


or (eliminating the trivial case where E [maxo<i<n |Mi|?] = 00) 


=pE 


eae 7 
p—1 


=qE 


E < El|M,) |"? B 


M; 


|M,,| (ws 


ix 
O<i<n 


Therefore 


yi 
E max iM <4 || Mn lp, 


O<i<n 
that is, since 1 — : ==, 


I amax | M;| lb< @ || Mn Ilp - 
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Hoeffding’s Inequality 


Theorem 8.2.4 Let {Mn}nso be a real F,-martingale such that, for some se- 
quence C1, C2,... of real numbers, 


TAI 1 sete ea) (Geel (ey) 


Then, for alla > 0 and alln > 1, 


il n 
P(|Mn — Mo| > x) < 2exp (-3°/34) : (8.8) 

Proof. By convexity of z +> e”, for |z| < 1 and alla ER, 
ar? < 


1 
(l—z)e* + a +z)et. 


In particular, if 7 is a centered random variable such that P(|Z| < 1) = 1, 


By similar arguments, for all a € R, 


eet 


E [er en 


and, with a replaced by c,a, 
E [eaMa—Mn—a)| Fa] < ea /2- 

Therefore, 
E [ean Mo) _ 


EB 
-E£ [etna Mo) [go Me Mn—a)| es] 
E [erMn—1— Mo) 


[ean —Mo) 9a(Mn —Mn—1)] 


2 2 
a*cz, [2 
RiGee. 
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and then by recurrence 
E [er Ma—Mo)] < 8? DEG | 
In particular, with a > 0, by Markov’s inequality, 
P(M, =e Ss x) cer pen ete Ma) < en aet ge ee ; 


Minimization of the right-hand side with respect to a gives 
P(M, — Mp > 2) <2? / D4, 
The same argument with M,) — M,, instead of M,, — Mo yields the bound 
P(—(Mn — Mo) > 2) < ee / Chad, 


The announced bound then follows from these two bounds since for any random 
variable X, and all « € Ry, P(|X| > a2) = P(X >a) + P(X < —2). 


EXAMPLE 8.2.5: ‘THE KNAPSACK. There are n objects, the 7-th has a volume 
V; and is worth W;. All these non-negative random variables form an independent 
family, the V;’s have finite means and the means of the W;,’s are bounded by 
M < ow. You have to choose integers z1,...,2, in such a way that the total 
volume }~"_, zi:V; does not exceed a given storage capacity c and that the total 
worth )*"_, z:V; is maximized. Call this maximal worth Z. We shall see that 


— x? 
P(|Z-ElZ]|>2)< 2exp {a} (x > 0). 

For this consider the variables Z; which are the equivalent of Z when the j-th object 
has been removed. Let now M; := E[Z | Fj], where Fj := 0 ((Ve, Wa); 1 <k < 9). 
Note that in view of the independence assumptions E [Z; | Fj] = E[Zj-1 | Fj. 
Clearly Z; < Z < Z;+M. Taking conditional expectations given F; and then 
F;-1 in this last chain of inequalities reveals that |M; — M;-1| < M. The rest is 
then just Hoeffding’s inequality. 


We now give a general framework of application. 


Let ¥ be a finite set, and let f : 7’ — R be a given function. We introduce 
the notation « = (#1,...,2y) and rf = (2,...,2,). In particular, x = x’. For 
cE XN, 2z€X andl <k<N, let 


fi(@, 2) = JF Giese Oe) 
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The function f is said to satisfy the Lipschitz condition with bound c if for all 
cE XN allze€X andalll<k<N, 


|fe(z; 2) — f(x)| Se. 


Let X,, Xo,...,Xy be independent random variables with values in Vv. Define the 
martingale 


M, = E[f(X)| Xt]. 


By the independence assumption, with obvious notations, 


E[f(X)|X7] = DL FORT * etal Bad P(X, mel = Zn41) 
TH 

and 

E[f(X) |x] = Donat (XP, @n, Ty) P(Xn = Ln) P(X = Try) - 

ay any 

Therefore 

|Mn — Mn-1| 

< oI KP an, 284) — FOE Xn, 2% )IP(Xn = on) PAM = 81) Se. 


No 2 
Tr41 "" 


EXAMPLE 8.2.6: PATTERN MATCHING. Take f(x) to be the number of oc- 
currences of the fixed pattern b = (b,...,b,) (kK < N) in the sequence 7 = 
(%4,...,@n), that is 


N-k+1 
> Lie;=b; jain @ineaiab,}> 
The mean number of matches in an 1D sequence X = (Xj,..., Xv) with uniform 


distribution on 4% is therefore 


N-k+1 N-k+1 : : 
> E| [oman Pets Kea) _ pa (a) 


that is, 
B[f(X)| = (N- c+ (a) 


The martingale M, := E[f(X)| XT] is such that Mo = E[f(X)]. Changing the 
value of one coordinate of 2 € 4 changes f(x) by at most k, we can apply the 
bound of Theorem 8.8 with c; = k to obtain the inequality 


P(|f(X) — E[f(X)]| 2 A) <2 


2 
he 


oh 
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8.3 The Optional Sampling Theorem 


Martingale theory rests on two pillars. The first pillar is the Doob’s optional 
sampling theorem. The second pillar is the martingale convergence theorem (and 
its avatars). 


The version of the optional sampling theorem given next is the most elementary 
one, sufficient for the elementary examples to be considered now. More general 
results are given later in this subsection. 


Theorem 8.3.1 Let {Mn}n>o be an F,-martingale, and let r be an F,,-stopping 
time (see Definition 8.1.10). Suppose that at least one of the following conditions 
holds: 


(a) P(t < no) =1 for some no > 0, or 
(3) P(r < 00) =1 and |M,| < K <0 whenn <r. 


Then 
E(M,| = E[Mo] (8.9) 


Proof. (a) Just apply Theorem 8.1.12 (Formula (8.3) with n = no). 
(3) Apply the result of (a) to the F,,-stopping time 7 A ng to obtain 
E|Mzano| = E[Mo] - 
But, by dominated convergence, 


lim E[Mrano] = EL lim, Mryno] = E[Mz] 
notoo 


ag 
notoco 


EXAMPLE 8.3.2: THE RUIN PROBLEM VIA MARTINGALES. The symmetric 
random walk {X,,}n>0 on Z with initial state 0 is an #X-martingale (Example 
8.1.2). Let 7 be the first time n for which X, = —a or +6, where a,b > 0. 
This is an FX-stopping time and moreover T < oo. Part (8) of the above re- 
sult can be applied with K = sup(a,b) to obtain 0 = E[Xo] = E[X,]. Writing 
v = P(—a is hit before b), we have 


E|X,] =—-av+b(1—v), 
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and therefore 


EXAMPLE 8.3.3: A COUNTEREXAMPLE. Consider the symmetric random walk 
of the previous example, but now define 7 to be the hitting time of b > 0, an 
almost surely finite time since the symmetric walk on Z is recurrent. If the optional 
sampling theorem applied, one would have 


0 = E[Xo] = E[X,] =}, 


an obvious contradiction. Of course, neither condition (a) nor ({) is satisfied. 


We are now ready for the statement and proof of Doob’s optional sampling 
theorem generalizing the elementary results given at the beginning of the present 
section. 


Theorem 8.3.4 Let {Y,}ns0 be an F,-submartingale (resp., martingale), and let 
71,72 be finite F,,-stopping times such that P(m% < 7m) =1. If fori = 1,2, 


E||¥;,|] < 00, (8.10) 
and 
lim inf £[|¥n|L fr on3] =O (8.11) 
then, P-a.s. 
NE, || ea] 2G (ies, = WE) (8.12) 
In particular, 
pal 2 |X| (resp., _ EY, |) : (8.13) 


More generally, if {7 }n>1 is a non-decreasing sequence of finite F,-stopping times 
satisfying conditions (8.10) and (8.11), the sequence {Y,,,},>1 is an F,,,-submartin- 
gale (resp., martingale). 


Proof. It suffices to give the proof for the submartingale case. The meaning of 
(8.12) is that, for all A € F,,, 


E|la¥.| > Blla¥n |. 
It is sufficient to show that for all n > 0, 


Haat es 2 El Lanta Yn | ) 
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or, equivalently since 7, = n implies 72 > n, 


| arteries 2 | intent al = Blane siatesny al : 
Write this as 
E[lantm>n}¥ra| 2 E[lantm>n}Yn] , (x) 


where B := AN {7 =n}. By definition of F,,, B € F,. It is therefore sufficient 
to show that for all n > 0, all B € F,, (x) holds. We have 


E[Lan{m2n}¥n] = E[Lantn=n}¥n] + E[Lentm2nt1y Yn] 
< E[lpntm=n}¥n| + [Lenten £[Yn+1|F nl] 
= E[lgntman}¥m| + E[Lanto>nsi}¥nt1| 
<£E [entn<m<nti}Y ro =F E|lantm>n+2}¥n+2] 


= E [Lan{n<m<m}Yra] a E | Latireseat Veal , 


that is, 
E[lann<em<m}Yr| = E[1pntm>n}Yn] -_ El langesmy Ym 


for all m > n. Therefore, by dominated convergence and hypothesis (8.11) 
E[Lan{mzn}¥ro] ~ 2 lim Lan{n<m<m} Yn] 
2 E[1antm>n}Yn| _ ant E[1lantm>m}¥m| 


= E|lantm>n}Yn| . 


Corollary 8.3.5 Let {Yn}n>o be an F,-submartingale (resp., martingale). Let 
71,72 be F,-stopping times such that 7) << N a.s., for some constant N < oo. 
Then (8.13) holds. 


Proof. This is an immediate consequence of Theorem 8.3.4. 


Corollary 8.3.6 Let {Y,}n>0 be a uniformly integrable F,,-submartingale (resp., 
martingale). Let ™%|,72 be finite F,,-stopping times. Then (8.12) holds. 


Proof. In order to apply Theorem 8.3.4, we have to show that conditions (8.10) 
and (8.11) are satisfied when {Y,,},>1 is uniformly integrable. Condition (8.11) 
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follows from part (b) of Theorem 6.5.3 since the 7;’s are finite and therefore 
P(t, > n) + 0 as nf oo. It remains to show that condition (8.10) is satis- 
fied. Let N < oo be an integer. By Corollary 8.3.5, if 7 is a stopping time (here 
T OF 72), 

E[Yo] < E[Yran] 


and therefore 


El|Yran|] = 2EIY iN! — ElY, pn] < 2E(Y7, 


TAN 


]— E[¥o] - 


The submartingale {Y,*},>0 satisfies 


N 
ElY Awl = >— Ellganan¥j'] + Ellgom Ys] 


t 
ay 
zs 
IA 
ty 
z 


Therefore 
E[l¥eanl] < 221%] + [Vol] < 3sup E[¥aI. 


Since by Fatou’s lemma E||Y;|] < liminfyy. El|Y-,n|], we have 


||¥el] < Soup Ell¥sI] 


a finite quantity since {Y,},>1 is uniformly integrable. 


Corollary 8.3.7 Let {Yn}n>o0 be an F,-submartingale (resp., martingale) and let 
T be an F,,-stopping time such that 


E|r] < oo. 
Suppose moreover that there exists a constant c < co such that, for alln > 0, 
E\|¥n41—Ynl| Fr] <c¢, P-as. on {7 > n}. 


Then E|\|Y;|] < co and 
ElY,] > ( resp, =) EM]. 
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Proof. In order to apply Theorem 8.3.4 with 7, = 0, 7] = 7, one just has to check 
conditions (8.10) and (8.11) for 7. Let Z := |Yo|. With Z,, := |Y¥n — Yn-i| (n > 1), 


= se 3 E [1pran}Zj] 


n=0 j=0 


n=0 


l{r=n} S- 4; 
j=0 


= DE [en 4] = DE [pen Zi] - 
: j=0 


For 7 > 1, {7 > j} ={7 < 7-1} © Fj-1 and therefore, 
E [le>jZj) = E Were F(Z, | Fy-i]] < eP(r > 5), (x) 


and 


T 


4 


j=0 


te 


E < EllYl] + LF (r > j) = EllYol] + cB lr] < 00 


Therefore condition (8.10) is satisfied since E'[|Y;|] < F bam 2;]. Moreover, if 
T> Nn, 


2455 
j=0 j=0 
and therefore 


E [tesa Yall & E 


Te 


j=0 


But, by (x), F bam Z;] < oo. Also, {7 > n} | @ as nt oo. Therefore, by 


dominated convergence 


lim aint E[l{r>n}l¥n |< lim inf E 


ltesn} s Z; )- 0. 


j=0 


This is condition (8.11). 
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Wald’s Formulas 


Theorem 8.3.8 Let {Zn}n>1 be an 1D sequence of real random variables such 
that E [|Z,|] < 00, and let r be an F2-stopping time with E[r] < oo. Then 
An 
n=1 


E = E[Z,|E[r]. (8.14) 


If, moreover, E[Z?] < oo, 


Var (>: a) = Var (Z;)E[r]. (8.15) 


=i 


Proof. Let Xo := 0, Xn := (4 +--+: + Zn) — nE[Z] (n > 1). Then {Xn }ns1 is 
an F2-martingale such that 
E||Xn+1 — Xnl | Fa} = EllZne1 — ELZi]| | Fr] 
= E|Z, — E[Z]| < 2E [|Z,|] < oo. 
Therefore Corollary 8.3.7 can be applied with Y, = S>;_, (Z — E[Z]) to obtain 


(8.14). For the proof of (8.15), the same kind of argument works, this time with 
the martingale Y, = X? — n Var (Z;). 


Theorem 8.3.9 Let {Z,}n>1 be 1D real random variables and let S, = Z,+-+++ 
Zn. Let pz(t) := Ele] and suppose that pz(to) exists and is greater than or 
equal to 1 for some tp #0. Let rT be an F2-stopping time such that E[r] < oo and 
|S,| <c¢ on {rt > n} for some constant c < co. Then 


elo Sr 
E =| ale (8.16) 
yz(to)” 
Proof. Let Yo := 1 and for n > 1, 
Y, _ elo Sn 
yz (to)” 
By application of the result of Example 8.1.3 with X; := = we have that the 


sequence {Y;,}n>o is an F42-martingale. Moreover, on {7 > n}, 
el0Zn+1 


sate 72] 


Yn : 
= Delay lle — walt] << 00 


Bll¥aia — Yal | FZ] = Ya | 
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since yz(to) > 1 and 


toSn |tole 
e€ e 
Yn = < 


< eltole, 
pz(to)” ~ pz(to)” — 


Therefore, Corollary 8.3.7 applies to give (8.16). 


8.4 The Martingale Convergence Theorem 


The second pillar of martingale theory is the martingale convergence theorem. 
This result is the probabilistic counterpart of the convergence of a non-negative 
non-increasing, or bounded non-decreasing, sequence of real numbers to a finite 
limit. It says in particular (but we shall give a more complete result soon) that a 
non-negative supermartingale converges almost surely to a finite limit. 


The Upcrossing Inequality 


The proof of the martingale convergence theorem is based on the upcrossing in- 
equality. 


Theorem 8.4.1 Let {S,}ns0 be an F,-submartingale. Let a,b € R with a < b, 
and let v, be the number of upcrossings of |a,b| before (<) time n. Then 


(b—a)E|y,] < E[(S, — a)*). (8.17) 


(By definition, an upcrossing occurs at time @ if S;, < a and if there exists ¢ > k 
such that 5; <b for j =1,...,@—1 and S; > b.) 


Proof. Since vy, is the number of upcrossings of [0,b — a] by the submartingale 
{(S, — a)*}n>1, we may suppose without loss of generality that S, > 0 and take 
a = 0, and then prove that 


bE[Yn] < E[Sn — Sol, (8.18) 


where Sp = 0 and Fo is the gross o-field. Define a sequence of F,,-stopping times 
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as follows 


™Tm = 0 
7 = inf{n > 7; S, = 0} 
T = inf{n > 7; S, > b} 


Tors1 = inf{n > Tox; Sp = 0} 
Tox = inf{n > Ton41; Sn > db} 


For i > 1, let 


yi = Lif tT <i < Tm41 for some odd m 


= 0if tT, <i<T 41 for some even m. 


Observe that 
fw=u=U ({ < i} Frnt <4) €Fi4 
oddm 
and that A 
bi, < y (pi( 8; 85.4). 
i=l 
Therefore 


n 


E[v,| << E Dols Si-1) |= 2 Ble(s:— S:-3) 


3 


= 2, Ele E[(S: — Sia) |Fi-al] = Yo Aloe [Si] Fi-1] — Si-1)] 


<r KE S3|F.a] = S34) I< Deel -1]) = E[S, — So]. 


We are now in a position to state and prove the fundamental martingale con- 
vergence theorem. 


Theorem 8.4.2 Let {Si}nso0 be an F,,-submartingale. Suppose moreover that it 
is L'-bounded, that is, 
sup E||S,|] < oo. (8.19) 


n>=0 


Then {Sn}n>so converges P-a.s. to an integrable random variable Sx. 
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Condition (8.19) can be replaced by the equivalent condition 


sup E[S*] < oo. 
n>0 


Indeed, if {S,}n>0 is an F,-submartingale, 


E [St] < E[|Sn|] < 2E [St] — E[S,] < 2E [St] — E [So] . 


By changing signs, the same hypothesis leads to the same conclusion for a 
supermartingale {S;,}n>o- Similarly to the previous remark, condition (8.19) can 
be replaced by the equivalent condition 


sup E[S,] < oo. 
n=0 


Proof. The proof is based on the following observation concerning any determin- 
istic sequence {x,}n>1. If this sequence does not converge, then it is possible to 
find two rational numbers a and b such that 


liminfz, <a<6<limsupz,, 
7 n 


which implies that the number of upcrossings of [a,b] by this sequence is infinite. 
Therefore to prove convergence, it suffices to prove that any interval [a,b] with 
rational extremities is crossed at most a finite number of times. 


Let v,,({a, 6]) be the number of upcrossings of an interval [a,b] prior (<) to time 
n and let v.([a, b]) = limptos Vn([a, 6]). By (8.17), 


(b— a)E[vn([a, 6])] < E[(Sn — a)"] < E[Sz] + Jal 
< sup E[S{] + ja] < sup E[[S;|] + |a| < oo. 
k>0 k>0 
Therefore, letting n t co, 
(b— a)E|v([a, b])| < 00. 


In particular, v,([a, b]) < oo, P-a.s. Therefore, P-a.s. there is only a finite number 
of upcrossings of any rational interval [a,b]. Equivalently, in view of the observa- 
tion made in the first lines of the proof, {S,,}n>0 converges P-a.s. to some random 
variable S,,. Therefore (by Fatou’s lemma for the second inequality): 


E|| Sool] = Hs |Sn|] < lim inf E1S,| < sup E|S,| < co. 
ntoo ntoo n>0 
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Corollary 8.4.3 


(a) Any non-positive submartingale {S,}n>0 almost surely converges to an inte- 
grable random variable. 


(b) Any non-negative supermartingale almost surely converges to an integrable 
random variable. 


Proof. (b) follows from (a) by changing signs. For (a), we have 
El\Snl] = -E[S.] < -E[S0l = El|S0l] < 00. 


Therefore (8.19) is satisfied and the conclusion then follows from Theorem 8.4.2. 


An immediate application of the martingale convergence theorem is to gam- 
bling. The next example teaches us that a gambler in a “fair game” is eventually 
ruined. 


EXAMPLE 8.4.4: FAIR GAME NOT SO FAIR. Consider the situation in Exam- 
ple 8.1.4, assuming that the initial fortune a is a positive integer and that the 
bets are also positive integers (that is, the functions b,4,;(X¢) € N; except if 
Y, = 0, in which case the gambler is not allowed to bet anymore, or equivalently 
eee * 0) := bp(Xo0,X1,.--,;Xn,0) = 0). In particular, Y, > 0 for all n > 0. 
Therefore the process {Y,,}n>0 is a non-negative F*-martingale and by the mar- 
tingale convergence theorem it almost surely has a finite limit. Since the bets 
are assumed positive integers when the fortune of the player is positive, this limit 
cannot be other than 0. Since Y,, is a non-negative integer for all n > 0, this can 
happen only if the fortune of the gambler becomes null in finite time. 


EXAMPLE 8.4.5: BRANCHING PROCESSES VIA MARTINGALES. The power of the 
concept of martingale will now be illustrated by revisiting the branching process. 
It is assumed that P(Z = 0) < 1 and P(Z > 2) > 0 (to get rid of trivialities). 
The stochastic process 


Xn 
Vr = mn? 
where m is the average number of sons of a given individual, is an 7*-martingale. 
Indeed, since each one among the X, members of the nth generation gives birth 
on average to m sons and does this independently of the rest of the population, 


E|Xn+1|Xn] = mX,, and 


E ar _5 as Xs _ Xn 


mrt mrt 1 mr 
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By the martingale convergence theorem, almost surely 


X, 
lim — =Y <oo. 
ntoo m” 
In particular, if m <1, then lim,;. X, = 0 almost surely. Since X,, takes integer 
values, this implies that the branching process eventually becomes extinct. 


If m = 1, then limppo Xn = Xoo < 00 and it is easily argued that this limit 
must be 0. Therefore, in this case as well the process eventually becomes extinct. 


For the case m > 1, we consider the unique solution in (0,1) of « = g(x) 
(g is the generating function of the typical progeny of a member of the population 
considered). Suppose we can show that Z, = «*” is a martingale. Then, by the 
martingale convergence theorem, Z,, converges to a finite limit and therefore X,, 
has a limit X,,, which however can be infinite. One can easily argue that this 
limit cannot be other than 0 (extinction) or co (non-extinction). Since {Z,}n>o0 
is a martingale, x = E[Z| = E[Z,]| and therefore, by dominated convergence, 
x = E|Z,.| = Elx*~] = P(X =0). Therefore zx is the probability of extinction. 

It remains to show that {Z,}n>0 is an F*-martingale. For all i € N and all 
x € [0,1], Elz*+1|X, = i] = z*. This is obvious ifi = 0. If i > 0, Xn+1 is the 
sum of i independent random variables with the same generating function g, and 
therefore, E[a*»+*|X,, = i] = g(z)' = x'. From this last result and the Markov 


property, 
Ela*e+|F*] _ Bex, = 7. 


The following results are important refinements of the fundamental martingale 
convergence theorem. 


Theorem 8.4.6 Let {My}n>o0 be an F,-martingale such that for some p € (1,00), 


sup E|M,,|P < oo. (8.20) 


n=0 
Then {Mn }n>0 converges a.s. and in L? to some finite variable M,. 


Proof. By hypothesis, the martingale {M,}n>0 is L’-bounded and a fortiori 
L'-bounded since p > 1. Therefore it converges almost surely. By Doob’s inequal- 
ity, E[maxoci<n |Mj|?] < q?E|M,|? and in particular, 


E{ max |M;|?] < @ sup E|M;|? < oo. 
k 


O<i<n 
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Letting n t 00, we have in view of condition (8.20) that 


E|sup |M,,|?] < oo. (8.21) 


n>0 


Therefore {|M,,|?}n>0 is uniformly integrable (Theorem 6.5.5). In particular, since 
it converges almost surely, it also converges in L+ (Theorem 6.5.7). In other words, 
{M,}n>0 converges in L?. 


The above result was proved for p > 1 (the proof depended on Doob’s inequal- 


ity, which is true for p > 1). For p = 1, a similar result holds with an additional 
assumption of uniform integrability. Note however that the next result also applies 
to submartingales. 


Theorem 8.4.7 A uniformly integrable F,,-submartingale {S,}n>0 converges a.s. 
and in L* to an integrable random variable S,, and ES. | Fn] > Sp- 


Proof. By the uniform integrability hypothesis, sup,, E[|S,|] < oo and therefore, 
by Theorem 8.4.2, S, converges almost surely to some integrable random variable 
S5,. It also converges to this variable in L1 since a uniformly integrable sequence 
that converges almost surely also converges in L1 (Theorem 6.5.7). 


By the submartingale property, for all A € F,, all m>n, 
EllaSh] < EllaSin| : 
Since convergence is in L’, 


lim EllaSm] = E[LaSso] i 


mtoo 


so that finally E[14S,] < E[14S,]. This being true for all A € F,,, we have that 
E|8o0| Fn] > Sn- 


The following result is Lévy’s continuity theorem for conditional expectations. 


Corollary 8.4.8 Let {Fr}n>1 be a filtration and let € be an integrable random 
variable. Let Fx, := a( Un> Fa) Then 


lim EE | Fo] = EUG | Feel (8.22) 


Proof. It suffices to treat the case where € is non-negative. The sequence 
{M, = EE| Fn] }n>1 is a uniformly integrable F,,-martingale (Theorem 6.5.4) and 
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by Theorem 8.4.7, it converges almost surely and in L+ to some integrable random 
variable M,,. We have to show that M,, = E[€| 7]. Form >n and A € Fy, 


E{14M,,] = E{laM,] = E[LaE|é| F,]] = Ellaé). 
Since convergence is also in L, limps E[L4Mm] = E[14M.]. Therefore 
E{laMoo] = E[Laé] (8.23) 
for all A € Ff, and therefore for all A € U,F,. The o-finite measures A > 
E{laM,| and A+ E[1,4] agreeing on the algebra U,F,, also agree on the small- 
est o-algebra containing it, that is F,,. Therefore (8.23) holds for all A € F,, 
(Theorem 4.1.32) and this implies 


EtlaMoo] = E[LaEl€ | Fool] , 


and finally, since M, is F.-measurable, Mx = E|E | Foo]. 


Backwards (or Reverse) Martingales 


In the following, pay attention to the indexation: the index set is the set of non- 
positive relative integers. Let {F,}n<o be a non-decreasing family of o-fields, that 
is, Fn © Fn41 for all n < —1. 


There is nothing new in the definition of “backwards” or “reverse” martingales 
or submartingales, except that the index set is now {...,—2, —1,0}. For instance, 
{Yn}n<o is an F,-submartingale if FE [Y, | Fn-i] > Yn-1 for all n < 0. The term 
“backwards” in fact refers to one of the uses that is made of this notion, that of 
discussing the limit of Y, as n | —oo. 


Reverse martingales or submartingales often appear in the following setting. 
Let {Z}k>0 be a sequence of integrable random variables. Suppose that 


Wipe 4G Fes Fe AZo TREO. 


Clearly, the change of indexation k + —n gives a “backwards” martingale. The 
next example concerns that situation. 


EXAMPLE 8.4.9: EMPIRICAL MEAN OF AN IID SEQUENCE. Let {Xn}n>1 be an 
IID sequence of integrable random variables and let 
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where S := X,;+---+ X;. We shall prove that 
E[Z,-1 | Gr] = Ze; 
where G, = 0(Zp, Zeyi, Ze42,.-.). It suffices to prove that for all k > 1, 
E[Z, | Ge] = Ze, (x) 
since it then follows that for m < k, 
E(Zm | Gu] = E[E[Z1 | Gm] | Gu] = E[Z1 | Ge] = 
By linearity, 


k 


Sk = E (Se | Gel = > EX; | Gal - 


j=1 


iFrom the fact that Gy = o0(Zp, Zea1, Ze42,---) = T(Sk, Xpoi, Xe+2,-.-) and by 
the 11D assumption for {Xn}n>1, 


k k 
Sk => FLX; | Ses Xer1 Xesa,---] = S> B[X; ee 
j=l 


j=l 


But the pairs (X;,5,) (1 <j < k) have the same distribution, and therefore 
k 
N° B[X; | Si] = KE [X1 | Sp] = KE [X, | Ge] = KE[Z | Gel , 
jel 


from which (x) follows. 


Theorem 8.4.10 Let {F,,}n<0 be a non-decreasing family of o-fields. Let {Sn}n<o 
be an F,-submartingale. Then: 


A. S,, converges P-a.s. and in L* as n | —oo to an integrable random variable 


Seson Cine 
B. with ae = Oaol ns 
Suse S 12 (Sp | Fell « 
with equality if {Sn}n<o 18 an F,-martingale. 


Proof. First note that by the submartingale property, S, < E[So | Fy| 
(n < 0). In particular, {S,}n<o is not only L'-bounded, but also uniformly inte- 
grable (Theorem 6.5.4). 
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A. Denoting by Um = Vm([a, b]) the number of upcrossings of [a,b] by {Si }n<o 
in the integer interval [—m, 0] and by v = v({a, b]) the total number of upcrossings 
of [a,b], the upcrossing inequality yields 


(b—a)E [Vm] < E[(Sp —a)t] <0, 


and letting mt oo, E'[v] < oo. Almost-sure convergence to an integrable random 
variable S_.. is then proved as in Theorem 8.4.2. Since {S,}n<o is uniformly 
integrable, convergence to S_.. is also in L?. 


B. Clearly, S_. is F_.»-measurable. Also, by the submartingale property, 
Sin < E [So | Fn] (n < —1), that is, for alln < —1 and all AE F,, 


[saps | sar. 
A A 


This is true for any A € F_, because F_., C F, for all n < —1. Since S, 
converges to S_., in L' as n | —oo, 4 S,dP > qi S_.. dP and therefore 


[s-waPs | soap (Ac Fx), 
A A 


which implies that S_., < F [So | F_~]. 


The martingale case is obtained using the same proof with each < symbo 
replaced by =. 


Statement B says that {Sn }ne-Nu{—co} is a submartingale relatively to the his- 
tory Caen ree tee 


EXAMPLE 8.4.11: THE STRONG LAW OF LARGE NUMBERS. The situation is 
that of Example 8.4.9. By Theorem 8.4.10, S;,/k — converges almost surely. By 
Kolmogorov’s zero-one law (Theorem 6.3.3), S;,/k — a, a deterministic number. 
It remains to identify a with F [X,]. We know from the first lines of the proof of 
Theorem 8.4.10 that {S;,/k},>1 is uniformly integrable. Therefore, by Theorem 


6.5.7, 
lim F =] =a 
ktoo k 
But for all k > 1, E [S;,/k] = E [Xj]. 
The uniform integrability of the backwards submartingale in Theorem 8.4.10 


followed directly from the submartingale property. This is not the case for a 
supermartingale unless one adds a condition. 
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Theorem 8.4.12 Let {Fyr}n<o be a filtration and let {Sr}n<o be an Fr- 
supermartingale such that 
sup E[S,,] < oo. (8.24) 
n<o 


Then 


A. S,, converges P-a.s. and in L* as n | —oo to an integrable random variable 
Sosen Ciel 


B. with LF ese = Nol ns 


Gass 2 Ey || esl] IP =a: 


Proof. It suffices to prove uniform integrability, since the rest of the proof then 
follows the same lines as in Theorem 8.4.7. 


Fix ¢ > 0 and select & < 0 such that 
lim F[S;] — EF [S;] <e. (x) 


il—oo 


Then 0 < B[S,] — E[S;,| < ¢ for all n < k. We first show that for sufficiently 


large A > 0, 
| \S,|dP <e. 
{|Sn|>A} 


It is enough to prove this for sufficiently large —n, here for —n > —k. The previous 
integral is equal to 


-| S,aP + £(S] - | SndP. 
{Sn<—A} {Sn <A} 


By the supermartingale hypothesis, this quantity is 


<-f S,aP + BS] — SeaP. 
{Sp<—A} {Sn <A} 


In view of (x), this is less than or equal to 


-| S.aP + E(S:]— f S,dP+e, 
{Sn<—A} {Sn <A} 


which is equal to 


| |Si,| dP+e. 
{|Sn|>A} 
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Since ¢ is an arbitrary positive quantity, it remains to show that Seisat>ay |S),| dP 
tends to 0 uniformly in n < 0 as \ f oo. But since {S7},>1 is a supermartingale 


E|\S,|] = B[S,] +22 [So | < sup F[S,] +2E [S65 | . 


n<o0 
Therefore, in view of hypothesis (8.24), 


P(|Sp| > d) < Plea Ps eg 


uniformly in n < 0, and therefore 


{|Sn|>A} 


uniformly in n. 


The following result is the backwards Lévy’s continuity theorem for conditional 
expectations. 


Corollary 8.4.13 Let {Fr}n<o be a history and let € be an integrable random 
variable. Then, with Foo := On<oFn; 


lim Elé| Fx] = E(E| Fool (8.25) 


Proof. M, := E[€| Fn] (n < 0) is an F,-martingale and therefore by the back- 
wards martingale convergence theorem, it converges as n | —oo almost surely and 
in L! to some integrable variable M_,, and 


Moo = E[Mo | F-00] = E[E[E | Fo] | Foo] = EE | Fcc] 


since F_. C Fo. 


The Robbins—Sigmund Theorem 


In applications, one often encounters random sequences that are not quite martin- 
gales, submartingales or supermartingales, but “nearly” so, up to “perturbations” . 
The statement of the result below will make this precise. 


Theorem 8.4.14 Let {Va}no1, {Bn}nd1, {In}nsi and {On }nd1 be real non-negative 
sequences of random variables adapted to some filtration {Fy}n>1 and such that 
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Then, on the set 
r={Ta<obn{ Dao (8.27) 
n>1 n>1 


the sequence {V,}n>1 converges almost surely to a finite random variable and more- 
over >> 5, 5n < co P-almost surely. 


Proof. 1. Let ap := 0 and 


On i= (Ie +A) (n> 1), 


and let. 
Vi :=Qn-1Vn, Vy = Onn» O71 = Andn (n> 1). 


n n 


Then 
E[Veai | Fr = AnE|Vn4i | Fn] < OnVn(1 = Bn) + anYn — OnOn > 


that is, since a,V,(1 + Br) = An—1Vn; 
BV, <7 |Pnl S = Gh 
Therefore, the random sequence {Y,,}n>1 defined by 


n-1 


is an F,-supermartingale. 


2. For a > 0, let 


The sequence {Y,,7, }n>1 is an F,,-supermartingale bounded from below by —a. It 
therefore converges to a finite limit. Therefore, on {T, = 00}, {Yn}n>1 converges 
to a finite limit. 


3. On, [[2.,(1+6;) converges almost surely to a positive limit and therefore 
limptoc @n > 0. Therefore, condition $7. %n < co implies }>,., yj, < 00. 


4. By definition of Y,, 


n-1 


n-1 
Yet > % =Vat > & => 4G, 
k=1 k=1 
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But on TN {T, = co}, {Yn }n>1 converges to a finite random variable, and therefore 
Est ue < 00. 


5. Since on PN{T, = c0}, os Wh < 00, Os, 0, < co and {Y,,}n>1 converges 
to a finite random variable, it follows that {V/},51 converges to a finite limit. 
Since limptoo @n > 0, it follows in turn that {V,}n>1 converges to a finite limit and 
Yo o1 on < co on TN {T, = co}, and therefore on PN (Us{T, = co}) =1L. 


Corollary 8.4.15 Let {Vi}no1, {Yr}nsi and {dn}nsi be real non-negative se- 
quences of random variables adapted to some filtration {Fy}n>1. Suppose that 
for alln>1 


Let {an}n>1 be a random sequence that is strictly positive and strictly increasing 


and let 
— Yn 
Pa — : 2 
> ; <oo} (8.29) 


n>1°" 


Then, almost-surely: 


= . Viqeve. « 
1. onT, the series )>,., += is convergent and >, % <x, 
> " 21 Gn 


2. onl {limptoo An < 00}, {Va}noi converges almost surely, and 


3. onTN {limp too An = 00}, limntoc va = 0 and limptoo oe = 0. 


Proof. 1. Let for n > 1 


n—-1 n 
ip Vp 1 1 in 
ep pe ea )+2 
kel ak 2 Fa Gk-1 Ak an, 
Since | — > 0, we have that Z, > 0 (n> 1). Also 
Yn On 
E[Zn41| Fn] < Zn + — - —. 
a, Oy 


Therefore, by Theorem 8.4.14, on I, {Zp }n>1 converges and net oe < oo. Note 
that in particular 


Vin+1 —_ Vn 


lim =OonT. (8.30) 
ntoo Qn, 
2. If moreover limptoo dn = Goo < 00, the convergence of >>, ., Metin implies 


that of = 2 ea (Vn4i — Vn), and therefore {V,,}n>1 converges. 
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. -V, . . 
3. If on the contrary lim,,t.. @, = 00, the convergence of >). Veen implies 


that of “+ (and therefore that of Ya by (8.30)) to 0 (recall Kronecker’s lemma: if 


an 
Gy, > 0 and a, t co, the convergence of asi = implies that limptoo = oe ty = 


0). 


8.5 Square-integrable Martingales 


Let {Fn}n>o0 be a filtration. Recall that a process {Hn }n>0o is called F,-predictable 
if for alln > 1, H,, is F,-1-measurable. 


Doob’s decomposition 


Theorem 8.5.1 Let {S,}ns0 be an F,-submartingale. Then there exists a P-a.s. 
unique non-decreasing F,,-predictable process {An}n>o0 with Ap = 0 and a unique 
F,,-martingale {M,,}n>0 such that for all n > 0, 


Syn = M, + An- 


Proof. Existence is proved by explicit construction. Let Mp := So, Ao = 0 and, 
forn > 1, 


n—-1 
M, := So + xe {Sj41 _ E[S541|F3]} ’ 
j=0 


n-1 
An = (E[Sj+1|F5] — $5). 
j=0 
Clearly, {M,}n>0 and {A,}n>0 have the announced properties. In order to prove 


uniqueness, let {/’},>0 and {A/,},>0 be another such decomposition. In partic- 
ular, for n > 1, 


A’ 


n+1 


Al = (Anti — An) + (Mn4i — Mn) — (Mi, - MZ). 


n+1 n 


Therefore 
ElAna1 = An | Fy = EAn+1 — A), | Fi] ; 


and, since Aj,, — Aj, and A,41 — A, are F,,-measurable, 


A’ 


nti ~ Ay =Anti-—An, P-as. (n> 1) 


from which it follows that A’, = A, a.s. for all n > 0 (recall that Aj = Ap) and 
then M/ = M, as. for all n > 0. 
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Definition 8.5.2 The sequence {An}n>0 in Theorem 8.5.1 is called the compen- 
sator of {Sin}n>0- 


Definition 8.5.3 Let {Mn}n>o be a square-integrable F,-martingale (that is, 
E(M?] < 00 for alln > 0). The compensator of the F,-submartingale {M?}n>0 is 
denoted by {(M)n}n>o and is called the bracket process of {My }n>o- 


By the explicit construction in the proof of Theorem 8.5.1, (/)o := 0 and for 
n> 1, 


n— 


(M) y= > {E[M?, | F;] — M?} = 9 {E[(M}., as M?) (Felts (8.31) 


1 
=0 


GS 


Also, for allO << k <n, 
E((Mn — Mx)” | Fi] = E[My — Mg | Fe] = El(M)n — (Mx | Fi] - 


Therefore, {M?—(M),}n>o is an F,-martingale. In particular, if My = 0, E[M?] = 
E\(M)n]. 


EXAMPLE 8.5.4: Let {Z,}n>0 be a sequence of 11D centered random variables 
of finite variance. Let Mp := 0 and M, := jet Z, for n > 1. Then, for n > 1, 


Theorem 8.5.5 If E[(M)] < 00, the square-integrable martingale {M,}n>0 
converges almost surely to a finite limit, and convergence takes place also in L?. 


Proof. This is Theorem 8.4.6 for the particular case p = 2. In fact, condition 
(8.20) thereof is satisfied since 


sup E [Mz] = sup E[(M)n] = E[(M)oo] < 00. 


n>1 n>1 
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The Martingale Law of Large Numbers 


Theorem 8.5.6 Let {Mn}nso be a square-integrable F,,-martingale. Then: 


A. On {(M) xo < co}, M, converges to a finite limit. 


B. On {(M) 0 = co}, Mn/(M), > 0. 
Proof. A. Let K > 0 be fixed, the random time 


Tk = inf{n>0: (M)ny > K} 


is an F,,-stopping time since the bracket process is F,-predictable. Also (M) nary < 
K and therefore by Theorem 8.5.5, {Mnnary }n>o converges to a finite limit. There- 
fore {Mn}n>o converges to a finite limit on the set {(/). < K} contained in 
{tx = co}. Hence the result since 


{(M).. < wo} = LU {7% =oo}. 


K>1 


B. Note that 
E(M?,1| Fal = M2 + (M) ng — (M)n- 
Define 
VY, = Me, Yn =(M)nti- (Mn, On = (Myr 


The result then follows from Part 3 of Corollary 8.4.15 (observe that there exists 
a kg such that a, > 1 for k > ko and 


S- Yn / Ak = S70 ((M) et — (M)x)/(Myia1 < i. dx < ov), 


k=ko k=ko 


which says, in particular, that W/Vn4i/dn = Mnii/(M)n41 converges to 0. 


We do not have in general {(M).. < co} = {{M,,}ns0 converges}. 


The following is a conditioned version of the Borel—Cantelli lemma. Note that, 
in this form, we have a necessary and sufficient condition. 


Corollary 8.5.7 Let {Fr}nsi be a filtration and let {An}n>1 be a sequence of 
events such that A, € F, (n > 1). Then 


» P(An|Fn—1) = ~| = {= la, = ~| 


n>1 n>1 
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Proof. Define {M,}n>o by Mo := 0 and for n > 1, 


n 


M, := y = PUA Peal): 


k=1 
This is a square-integrable F,,-martingale, with bracket process 


n 


(M)n = >~ P(Ax| Fe—1)(1 — P(Ax | Fe—1))- 


k=1 


In particular, 


3 


(Min <= ¥ P(Ag| Fis). 


k=1 


A. Suppose that S°7°, P(A, |Fp-1) < oo. Then, by the above inequality, 
(M).. < oo, and therefore, by Part A of Theorem 8.5.6, M,, converges. Since by 
hypothesis, 77°, P(Ax | Fr_1) < 00, this implies that )77°, la, < oo. 


B. Suppose that )77°., P(Ax | Fx_1) = 00 and (M).. < oo. Then M,, converges 
to a finite random variable and therefore 


8 9, 
k=1 P(Ag | Fx—1) Jone P(Ag | Fr—1) 
C. Suppose that \77°, P(Ax | F,-1) = co and (M),. = oo. Then as — 0 and 


a fortiori, 
M, 


Sr are 
dina P(Ak | Fe) 


that is, 


pe la, 5] 
dina P(Ak | Fes) 


8.6 Exercises 


Exercise 8.6.1. CONDITIONAL JENSEN’S INEQUALITY 

Let I be a general interval of R (closed, open, semi-closed, infinite, etc.) and let 
(a, b) be its interior, assumed non-empty. Let y : J > R be a convex function. Let 
X be an integrable real-valued random variable such that P(X € J) = 1. Assume 
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moreover that either y is non-negative, or that y(X) is integrable. Prove that for 
any sub-o-field G C F 
Ely(X)|9] 2 p(E[X|9]). 


Exercise 8.6.2. DISCOUNTED PRODUCT 
Let {X,}n>1 be a sequence of independent integrable random variables with a 
common mean m # 0. Show that 


Yn, = m-"X{Xo--+Xn (n> 1) 
is an F*-martingale. 


Exercise 8.6.3. MEAN HITTING TIME VIA MARTINGALES 

Let {X,}n>0 be a symmetric random walk on Z. Show that {X,},50 and 
{X?—n}, 5, are F*-martingales. Deduce from this the mean of T of the hit- 
ting time of {—a,b}, where a and b are positive integers. 


Exercise 8.6.4. PROBABILITY OF HIT 
Let {X;,},,59 be a HMC with state space E, and let B be a closed subset of states, 
that is, 
jeB 
Let T be the hitting time of B, and let for i € E, 


h(i) := P,(T < oo). 
Show that {h(Xn)},,50 is a F;*-martingale. 


Exercise 8.6.5. RUINED AGAIN! 


7 
Show that the function h(i) = (2) is harmonic for the nonsymmetric random 


walk on Z (with pjiti = p,pis-1 = @ = 1— p, p # §), where p € (0,1), p F 5. 


Apply the optional sampling theorem to obtain the ruin probability in the ruin 
problem of Example 9.1.10. 


Exercise 8.6.6. THE LEVY MARTINGALE 
Let {X,},59 be an HMC with state space E and transition matrix P, and let 
f : E —R bea bounded function. Show that the process 


| 
a 


n 


My = f(Xn) — f(Xo) (P— I) f(Xx) 


= 
ll 
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is an FX-martingale. 


Exercise 8.6.7. MARTINGALE CHARACTERIZATION OF AN HMC 

Let {Xn},,59 be a stochastic process with values in the countable space E. It is 
not assumed to be an HMC. Let P be some transition matrix on E. Prove that 
if for all bounded f : E > R, {Mj}n0 defined in Exercise 8.6.6 is a martingale 
with respect to {X,,},,59, then {X,},,59 is an HMC with transition matrix P. 


Exercise 8.6.8. A MARTINGALE REPRESENTATION THEOREM 

Let {Xn}nso be a sequence of {0,1}-valued random variables and let A,-1 := 
E |X, | FX4] (n > 0), where F_, := {@,Q}. Show that any 7*-martingale 
{Mn }ns>o is of the form 


Mn = Mo + S> Hy(Xj— dj-1), 


j=0 


where {H,,}n>0 is an F*-predictable sequence. 


Exercise 8.6.9. A MARTINGALE nue ON A PERMUTATION 


Let a,,...,a% € R be such that = 4 = 0. aa am be a completely random 
denautation of a ,k}, that is, em = ee = q for all permutations 7 of 
{1,...,k}. Let F, := = o(r (1),...,7(n)) <n <k) and 


k n 
aa yy An(j) 


Show that {X,}i<n<, is an F,,-martingale. 


Exercise 8.6.10. F, 
Prove Theorem 8.1.11. 


Exercise 8.6.11. RUINED AGAIN. 

Show that the function h(i) = (2) is harmonic for the nonsymmetric random 
walk on Z (with piit1 = p, piz-1 = 1— p, where p € (0,1) and p# s). Apply the 
optional sampling theorem to obtain the ruin probability in Example 9.1.10. 


Exercise 8.6.12. ABSORPTION PROBABILITY 
Consider the HMC {X,},5, with state space E = {0,1,...,m} and transition 


probabilities 
m i\? i\™? 
OQ" 
j m m 
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In particular, 0 and m are absorbing states. 
(a) Show that {X,,},,s9 is a martingale. 


(b) Compute the probability of absorption by state 0. 


Exercise 8.6.13. UPCROSSINGS 

Let {M,}n>0 be an F,-martingale. Let a,b € R with a < }, and let v, be the 
number of upcrossings of [a,b] before (<) time n. For k > 1, let A, be the event 
that there are exactly k — 1 upcrossings of [a,b] before (<) time n. Show that 


(b—a)P(y,>k) < E[(a-— M,)1a,] - 


Exercise 8.6.14. E'[X | F,] = X(r) 
Let 7 be a stopping time for the filtration {F,},51. Let X, := E[X | F,] (n> 1) 
where X is an integrable random variable. Prove that EL [X | F,] = X(r). 


Exercise 8.6.15. MARTINGALE BOUNDED BY AN INTEGRABLE RANDOM VARI- 
ABLE 

Let {X,,}n>1 be an F,,-martingale and let Z be an integrable random variable such 
that X,, < Z (n > 1). Prove that {X,}n>1 converges almost surely. 


Exercise 8.6.16. THE GAMBLER WITH UNLIMITED CREDIT 

Consider the gambling situation of Example 8.1.4 when the stakes are bounded, 
say by M, and when the initial fortune of the gambler is a. But we suppose that 
the gambler can borrow whatever amount he needs, so that his “fortune” Y, at 
any time n can take arbitrary values. Prove that 


2 
P(\Y, — al > A < 2exp (-~7) : 


Exercise 8.6.17. FAIR COIN TOSSES 

Consider a Bernoulli sequence of parameter $ representing a fair game of HEADS 
and TAILS. Let X, be the number of HEADS after n tosses. Use Hoeffding’s 
inequality to prove that 


P(\Xn — E[Xy]| > A) < 2exp (-~) 
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Exercise 8.6.18. KRICKEBERG’S DECOMPOSITION 

Prove that an F,-martingale {M,},,) such that sup,s9 E[|Mn|] < 00 is the dif- 
ference of two non-negative F,,-martingales. (Hint: Doob’s decomposition applied 
to |M,|.) 


Exercise 8.6.19. POLYA’S URN 

At time 0 an urn contains exactly one black ball and one white ball. At time 
n > 0, a ball is drawn at random and then at time n+ 1 this ball is put back into 
the urn together with another ball of the same color. In particular, there are at 
time n ee n ia 2 balls in the urn. Let B, be the number of black balls in the 
urn. Let X, aa be the proportion of black balls at time n. Show that {Xn }n>0 
isa eee ond that the ratio of the number of black balls to the number of 
white balls converges. 


Exercise 8.6.20. RECORDS 

Let {X;,}n>1 be an IID sequence of random variables with a common cumulative 
distribution F’ that is continuous. For 1 <i <n, let Y; := 1 if and only if X; = 
max(X1,...,X;). We shall admit that X; is uniformly distributed on {1,...,7} 
and that {Yi}i<ien is UD. Let Z, = SOP, 1yy,=1; (the number of times a record is 
broken, that is, the number of i’s such that X; > max(X1,...,X;-1)). Prove that 


—- — 1 almost surely. 
nn 


Exercise 8.6.21. A MAXIMAL INEQUALITY 
Let {Xn}n>0 be a centered square-integrable martingale. Let \ > 0. Prove the 


following inequality: 
E[X?] 
‘ (pax F s) SB + 


Hint: With c > 0, work with the sequence {(X,, + c)?}n>o and then select an 
appropriate c. 


Exercise 8.6.22. AN EXTENSION OF HOEFFDING’S INEQUALITY 
Let M be a real FX-martingale such that, for some sequence dj, d2,... of real 
numbers, 

P(By < Mn- Mn-1< Bon t+dn) =1 (n>1), 


where for each n > 1, B,, is a function of PCa Prove that, for all x > 0, 


P(|M;, — Mo| > x) < 2exp (2° /S) ; 
i=1 
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Exercise 8.6.23. THE DERIVATIVE OF A LIPSCHITZ CONTINUOUS FUNCTION 
Let f : {0,1) + R satisfy a Lipschitz condition, that is, 


If(z)—- Fw) < Mla—y| (wy € (0,1), 

where M < oo. Let 2 = (0,1), F = B((0,1)) and let P be the Lebesgue measure 
on (0,1). Let for alln > 1 

re 

En(w) := > 1 ¢(e-1)2-",k2-")} (W) 

k=1 

and 
Fr =f  1<k<n). 
(i) Show that F,, = o(€,) and V,F;, = B({0,1)). 


(ii) Let 
Q-n 
Show that {X,}n>1 is a uniformly integrable F,,-martingale. 


Xp i= 


(iii) Show that there exists a measurable function g : [0,1) > R such that X, — g 
P-almost surely and that X, = FE’ [g | Fy]. 


(iv) Show that for allm > 1 and all k (1 < k < 2”) 


k2-” 
fe) — f(0)= f(x) ae 
0 
and deduce from this that 


#@) =F 7, “gly)dy (x € [0,1)). 


Exercise 8.6.24. A NON-UNIFORMLY INTEGRABLE MARTINGALE 
Let {Xn}n>0 be a sequence of IID random variables such that P(X, = 0) = 
P(X, = 2) = 4 (n> 0). Define 


Z = || GeO): 
j=l 


Show that {Z,}ns0 is an FX-martingale and prove that it is not uniformly inte- 
erable. 
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Exercise 8.6.25. THE BALLOT PROBLEM VIA MARTINGALES 

This exercise proposes an alternative proof for the ballot problem. Let k := a+b 
and let D,, be the difference between the number of votes for A and the number 
of votes for B at time n > 1. Prove that 


is a martingale. Deduce from this that the probability that A leads throughout 
the voting process is (a — b)/(a +b). Hint: 7 := inf{n; X, =O} A (k— 1). 


Exercise 8.6.26. A VOTING MODEL 

Let G = (V,€) be a finite graph. Each vertex v shelters a random variable X,,(v) 
representing the opinion (0 or 1) at time n of the voter located at this vertex. 
At each time n, an edge (v, w) is chosen at random, and one of the two vertices, 
again chosen at random (say v), reconsiders his opinion passing from X,,(v) to 
Xnii(v) = X,(w). The initial opinions at time 0 are given. Let Z, be the total 
number of votes for 1 at time n. Show that {Z,,},>1 is a martingale that converges 
in finite random time to a random variable Z. taking the values 0 or |V|, the 
probability that all opinions are eventually 1 being equal to the initial proportion 
of 1’s. 


Check for 
updates 


Chapter 9 
Markov Chains 


Discrete-time homogeneous Markov chains are sequences {X,}n>0 of random vari- 
ables with values in some denumerable set £, that can always be represented (in 
a sense to be made precise) by a recurrence equation Xn41 = f(Xn, Zn41), where 
{Zn }n>1 is an IID sequence independent of the initial state Xo. The probabilistic 
dependence on the past is only through the previous state, but this limited amount 
of memory suffices to produce enough varied and complex behavior to make Markov 
chains the most important source of stochastic models in the applied sciences. 


9.1 The Transition Matrix 


A particle moves on a denumerable set FE. If at time n, the particle is in position 
X, = 1, it will be at time n+ 1 in a position X,4; = j chosen independently of 
the past trajectory X;,-1, X,—2 with probability p;;. This can be represented by a 
labeled directed graph, called the transition graph, whose set of vertices is FE’, and 
for which there is a directed edge from i € EF to 7 € E with label p,; if and only 
the latter quantity is positive. Note that there may be “self-loops”, corresponding 
to positions 7 such that p;; > 0. 


2 
P12 
1 
P11 & P32 
P41 
3 
4 
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This graphical interpretation of a Markov chain in terms of a “random walk” on 
aset FE is adapted to the study of random walks on graphs. Since the interpretation 
of a Markov chain in such terms is not always the natural one, we proceed to give 
a more formal definition. 


Definition 9.1.1 If for all integers n > 0 and all states ig, i1,...,%n—1, 2, J, 
P(Xn41 =j|Xn = 1, Xn-1 = eee, a = ig) = P(Xn41 =7|Xn = i), 


this stochastic process is called a Markov chain, and a homogeneous Markov chain 
(HMC) if, in addition, the right-hand side is independent of n. 


The matrix P = {p,;}i jen, where 
Pij = P(Xn41 = 4 | X=) 


is called the transition matrix of the HMC. Since the entries are probabilities, and 
since a transition from any state i must be to some state, it follows that 


pig = 0, and So pit =1 
keB 


for all states 7,7. A matrix P indexed by E and satisfying the above properties is 
called a stochastic matrix. The state space may be infinite, and therefore such a 
matrix is in general not of the kind studied in linear algebra. However, the basic 
operations of addition and multiplication will be defined by the same formal rules. 
The notation x = {x(i)}ice formally represents a column vector, and 27 is the 
corresponding row vector. 


The Markov property easily extends (Exercise 9.7.2) to 
P(A|X, =i, B) = P(A| Xn, =1), 


where 


A= {Xnyi = hi,-- +) Xnth = je}, B= {Xo = to, ---,Xn-1 = tn-1} - 
This is in turn equivalent to 
P(AN B|X, =1t) = P(A| X, =i)P(B| X, =1). 
That is, A and B are conditionally independent given X,, = 1. 


In other words, the future at time n and the past at time n are conditionally 
independent given the present state X, = 7. In particular, the Markov property is 
independent of the direction of time. 
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Notation. We shall from now on abbreviate P(A| Xo = 7) as P;(A). Also, if 
jo is a probability distribution on E, then P,,(A) is the probability of A given that 
the initial state Xo is distributed according to wp. 


The distribution at time n of the chain is the vector 1, := {v,(i)}iex, where 
Un(t) = P(X, =71). 


From the Bayes rule of total causes, Yn4i(J) = Cieg Yn(2)pij, that is, in matrix 


form, v2 “= v'P. Iteration of this equality yields 


“= P™. (9.1) 
The matrix P™ is called the m-step transition matrix because its general term is 


In fact, by the Bayes sequential rule and the Markov property, the right-hand side 
equals )0 5. i,_,ee Pii:Piriz ** * Pim—1j, Which is the general term of the m-th power 


of P. 


The probability distribution 1 of the initial state Xo is called the initial distri- 
bution. {From the Bayes sequential rule and in view of the homogeneous Markov 
property and the definition of the transition matrix, 


P(Xo = ip, X1 =,-.-,X-= ik) = Yo(i9) Pioir Din vip + 


Therefore, 


Theorem 9.1.2 The distribution of a discrete-time HMC is uniquely determined 
by its initial distribution and its transition matric. 


Many HMCs receive a natural description in terms of a recurrence equation. 


Theorem 9.1.3 Let {Z,}n>1 be an UD sequence of random variables with values 
in an arbitrary space F. Let E be a countable space, and f : E x F + E be some 
function. Let Xo be a random variable with values in E, independent of {Z,}n>1- 
The recurrence equation 


Xn = HOG. Zn+1) (9:2) 
then defines an HMC. 


Proof. Iteration of recurrence (9.2) shows that for all n > 1, there is a function 
Gn such that X;, = gn(Xo, Z1,---, Zp), and therefore P(Xj41 =j|Xn =i, Xn-1 

lint ae , Xo = io) = P(f (i, Zn+1) = j | Xn = 1, Xn-1 = In—1) assets Xo = ig) = 
P(f(i, Zn41) = J), since the event {Xo = to,..-,Xn—1 = tn-1, Xn = 2} is express- 
ible in terms of Xo, Z,...,Z, and is therefore independent of Z,4;. Similarly, 
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P(Xn41 = 7| Xn = 1) = P(f (4, Zn41) = J). We therefore have a Markov chain, 
and it is homogeneous since the right-hand side of the last equality does not depend 
on n. Explicitly: 

py = PF, A) = 3). (9.3) 


EXAMPLE 9.1.4: 1-D RANDOM WALK, TAKE 1. Let Xo be a random variable 
with values in Z. Let {Z,}n>1 be a sequence of IID random variables, independent 
of Xo, taking the values +1 or —1, and with the probability distribution P(Z, = 
+1) =p, where p € (0,1). The process {X;,}n>1 defined by 


Xn41 = Xn at Zn+1 


is, in view of Theorem 9.1.3, an HMC, called a random walk on Z. It is called a 


“symmetric” random walk if p = 3. 


EXAMPLE 9.1.5: THE REPAIR SHOP, TAKE 1. During day n, Z,41 machines 
break down, and they enter the repair shop on day n+ 1. Every day one machine 
among those waiting for service is repaired. Therefore, denoting by X,, the number 
of machines in the shop on day n, 


Xngi = (Xn —1)64+ Zag, (9.4) 
where at = max(a,0). In particular, if {Z,}n>1 is an ITD sequence independent of 
the initial state Xo, then {X,,}n>0 is a homogeneous Markov chain. In terms of 
the probability distribution P(Z,; =k) =a, (k > 0), its transition matrix is 


a, a2 a3 


ag ay a2 


ag 
ag a, ag a3 
0 
0 0 ag ay 


Indeed, from (9.3), 


py = P(é-1)7 + 4% = 9) = P(A =j —- (¢-1)*) = 05_G-1. 
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EXAMPLE 9.1.6: STOCHASTIC AUTOMATA. A finite automaton (E£, A, f) can 
read sequences of letters from a finite alphabet A written on some infinite tape. 
It can be in any state of a finite set E, and its evolution is governed by a function 
f:ExA-— E, as follows. When the automaton is in state 7 € FE and reads letter 
a € A, it switches from state i to state 7 = f(i,a) and then reads on the tape the 
next letter to the right. 


Cc 


Figure 9.1: The automaton: the recognition process and the Markov chain. 
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An automaton can be represented by its transition graph G having for nodes 
the states of E. There is an oriented edge from the node (state) 7 to the node 7 if 
and only if there exists an a € A such that 7 = f(i,a), and this edge then receives 
label a. If 7 = f(i,a1) = f(i,a2) for a, A ag, then there are two edges from i to 
j with labels a; and ag, or, more economically, one such edge with label (a1, a2). 
More generally, a given oriented edge can have multiple labels of any order. 


Consider, for instance, the automaton with alphabet A = {0,1} corresponding 
to the transition graph of Figure 9.1a. As the automaton, initialized in state 0, 
reads the sequence of Figure 9.1b from left to right, it passes successively through 
the states (including the initial state 0) 


0100123100123123010. 


Rewriting the sequence of states below the sequence of letters, it appears that the 
automaton is in state 3 after it has seen three consecutive 1’s. This automaton is 
therefore able to recognize and count such blocks of 1’s. However, it does not take 
into account overlapping blocks (see Figure 9.1b). 


If the sequence of letters read by the automaton is {Z,}n>1, the sequence of 
states {X,,}n>0 is then given by the recurrence equation X,41 = f(Xn, Zp41) and 
therefore, if {Z,}n>1 is ID and independent of the initial state Xo, then {Xn}n>1 
is, according to Theorem 9.2, an HMC. 


Not all homogeneous Markov chains receive a “natural” description of the type 
featured in Theorem 9.1.3. However, it is always possible to find a “theoretical” 
description of this kind. 


Theorem 9.1.7 For any transition matriz P on E, there exists a homogeneous 
Markov chain with this transition matriz and with a representation such as in 
Theorem 9.1.8. 


Proof. Define 


jet j 
Xn =Jif D0 pxae S Za < D0 Dxak 
k=0 k=0 


where {Z,}n>1 is ID, uniform on [0,1]. By application of Theorem 9.1.3 and of 
formula (9.3), we check that this HMC has the announced transition matrix. 


As we already mentioned, not all homogeneous Markov chains are naturally 
described by the model of Theorem 9.1.3. A slight modification of this result 
considerably enlarges its scope. 
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Theorem 9.1.8 Let things be as in Theorem 9.1.8 except for the joint distribu- 
tion of Xo, 21, Z2,.... Suppose instead that for alln > 0, Zny41 1s condition- 
ally independent of Zp,...,Z1, Xn-1,---,Xo0 given Xn, and that for alli,j € E, 
P(Zn41 = k| Xn = 1%) is independent of n. Then {Xn}ns0 is an HMC, with transi- 
tion probabilities 

pig = P(f(i, 1) = 5 | Xo =2). 


Proof. The proof is quite similar to that of Theorem 9.1.3 and is left as an 
exercise. 


EXAMPLE 9.1.9: "THE EHRENFEST URN, TAKE 1. This idealized model of dif- 
fusion through a porous membrane, proposed in 1907 by the Austrian physicists 
Tatiana and Paul Ehrenfest to describe in terms of statistical mechanics the ex- 
change of heat between two systems at different temperatures, considerably helped 
our understanding of the phenomenon of thermodynamic irreversibility. It features 
N particles that can be either in compartment A or in compartment B. 


Suppose that at time n > 0, X, = i particles are in A. One then chooses a 
particle at random, and this particle is moved at time n + 1 from where it is to 
the other compartment. Thus, the next state X,+1 is either i — 1 (the displaced 


particle was found in compartment A) with probability 4, or i+ 1 (it was found 


in B) with probability — . This model pertains to Theorem 9.1.8. For all n > 0, 


Xnt1 = Xr + Znt1 ; 


where Z, € {—1,+1} and P(Z,4, = —1|X, =i) = 4. The nonzero entries of 
the transition matrix are therefore 


PiiHl = N” Pii-l = N° 
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a 1 
N N 


zI- 
zl 
2 


First-step Analysis 


Some functionals of homogeneous Markov chains such as probabilities of absorption 
by a closed set and average times before absorption can be evaluated by a technique 
called first-step analysis. 


EXAMPLE 9.1.10: ‘THE GAMBLER’S RUIN, TAKE 1. Two players A and B play 
“heads or tails’, where heads occur with probability p € (0,1), and the successive 
outcomes form an IID sequence. Calling X,, the fortune in dollars of player A at 
time n, then Xn41 = Xn + Znyi1, where Z,41; = +1 (resp., —1) with probability 
p (resp., ¢:= 1—p), and {Z,}n>1 is UD. In other words, A bets $1 on heads at 
each toss, and B bets $1 on tails. The respective initial fortunes of A and B are 
a and b (positive integers). The game ends when a player is ruined, and therefore 
the process {X;,}n>1 is a random walk as described in Example 9.1.4, except that 
it is restricted to E = {0,...,a,a+1,...,a+b=c}. The duration of the game is 
T, the first time n at which X, = 0 or c, and the probability of winning for A is 
u(a) = P(Xp =c| Xo =a). 


A wins 
c=a+b 


1 2 3 4 5 6 7 8 9 10 T=11 


The gambler’s ruin 
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Instead of computing u(a) alone, first-step analysis computes 


u(i) = P(Xr =c| Xo = 21) 


for all states 7, 0 <i <c, and for this, it first generates a recurrence equation for 
u(i) by breaking down event “A wins” according to what can happen after the first 
step (the first toss) and using the rule of total causes. If Xp = 7, 1 <i < c—1, then 
X, =i+1 (resp., X; = i— 1) with probability p (resp., q), and the probability 
of winning for A with updated initial fortune i+ 1 (resp., i — 1) is u(i + 1) (resp., 
u(t —1)). Therefore, for i, 1 <i<c—1, 


u(t) = pu(i+1)+qu(t—1), 
with the boundary conditions u(0) = 0, u(c) = 1. 


The characteristic equation associated with this linear recurrence equation is 
pr? —r+q=0. It has two distinct roots, r, = 1 and rz = = ifp # 3, and a 


double root, 7; = 1, ifp= z. Therefore, the general solution is u(i) = Ari + pr = 


A+ pb (2) when p # q, and u(i) = Art + pirt = A+ pi when p=q= 3. Taking 
into account the boundary conditions, one can determine the values of A and wp. 
The result is, for p 4 q, 


l=) 
uli) =, 
i= 0 
and for p=q=$, 
: z 
u(t) = a 


In the case p = q = 3, the probability v(i) that B wins when the initial fortune of B 
is c—7 is obtained by replacing 7 by c—7 in the expression for u(z): v(7) = — = 1-2. 
One checks that u(i) + v(i) = 1, which means in particular that the probability 
that the game lasts forever is null. The reader is invited to check that the same is 
true in the case p # q. 


First-step analysis can also be used to compute average times before absorption 
(Exercise 9.7.5). 
Communication and Period 


These two concepts are topological in the sense that they concern only the naked 
transition graph (with only the arrows, without the labels). 
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Definition 9.1.11 State j is said to be accessible from state i if there exists an 
M > 0 such that pyj(M) > 0. States i and j are said to communicate if i is 
accessible from j and j is accessible from i, and this is denoted byi © j. 


In particular, a state 7 is always accessible from itself, since p;;(0) = 1 (P° = J, 
the identity). 


For M > 1, pij(M) = doin. ine, Pits ***Piaraj, and therefore p,;(M) > 0 if and 
only if there exists at least one path 7,71,...,7.¢-1,7 from 7 to 7 such that 
Pit, Pirin ** * Pinpajg > 9, 


or, equivalently, if there is a directed path from i to 7 in the transition graph G. 
Clearly, 


ivi (reflexivity), 
iojsejeoi (symmetry), 
Le gRIekSIiCk (transitivity). 


Therefore, the communication relation (<+) is an equivalence relation, and it gen- 
erates a partition of the state space EF into disjoint equivalence classes called com- 
munication classes. 


Definition 9.1.12 A closed state i is one such that pj; = 1. More generally, a 
closed set C' of states is one such that for alli € C, Dy jec Pi =1. 


Definition 9.1.13 If there exists only one communication class, then the chain, 
its transition matrix, and its transition graph are said to be irreducible. 


EXAMPLE 9.1.14: THE REPAIR SHOP, TAKE 2. Recall that this Markov chain 
satisfies the recurrence equation 


Ka (= 1 ys (9.5) 


where at = max(a,0). The sequence {Z,}n>1 is assumed to be IID, independent 
of the initial state Xo, and with common probability distribution 


P(Z, =k) = ax, k 20 


of generating function gz. 


This chain is irreducible if and only if P(Z; = 0) > 0 and P(Z, > 2) > 0 as 
we now prove formally. Looking at (9.14), we make the following observations. If 
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P (Zns1 = 0) = 0, then X41 > X, a.s. and there is no way of going from i to i—1. 
If P(Znai < 1) =1, then Xn41 < Xn, and there is no way of going from 7 to i+1. 
Therefore, the two conditions P(Z, =0) > 0 and P(Z, > 2) > 0 are necessary 
for irreducibility. They are also sufficient. Indeed if there exists an integer k > 2 
such that P(Z,41 = k) > 0, then one can jump with positive probability from any 
i>Otoi+k—1>iorfromi=0tok>0. Also if P(Z,4, =0) > 0, one can 
step down from 7 > 0 to i — 1 with positive probability. In particular, one can go 
from i to 7 < i with positive probability. Therefore, one way to travel from i to 
j > iis by taking several successive steps of height at least k — 1 in order to reach 
a state | > i, and then (in the case of | > 7) stepping down one stair at a time 
from I to i. All this with positive probability. 


Consider the random walk on Z (Example 9.1.4). Since 0 < p < 1, it is 
irreducible. Observe that E = Co + C1, where Cp and C}, the set of even and odd 
relative integers respectively, have the following property. If you start from i € Co 
(resp., Cy), then in one step you can go only to a state 7 € C (resp., Co). The 
chain {X,,} passes alternately from one cyclic class to the other. In this sense, the 
chain has a periodic behavior, corresponding to the period 2. More generally, for 
any irreducible Markov chain, one can find a unique partition of E into d classes 
Co, Ci, .-., Ca-1 such that for all k,i € Cy, 


S py =1, 
JECk41 


where by convention Cg = Co, and where d is maximal (that is, there is no other 
such partition Cj, Ci,...,C%_, with d’ > d). The proof follows directly from 
Theorem 9.1.17 below. 


The number d > 1 is called the period of the chain (resp., of the transition 
matrix, of the transition graph). The classes Co, Ci,...,Cu_1 are called the cyclic 
classes. The chain therefore moves from one class to the other at each transition, 
and this cyclically. 


We now give the formal definition of period. It is based on the notion of greatest 
common divisor of a set of positive integers. 


Definition 9.1.15 The period d; of state i € E is, by definition, 
d; = GoD{n > 1; p(n) > O}, 


with the convention d; = +00 if there is non > 1 with pu(n) > 0. Ifd; = 1, the 
state i is called aperiodic. 


320 CHAPTER 9. MARKOV CHAINS 


Theorem 9.1.16 Jf states i and j communicate, then they have the same period. 


Proof. Asi and j communicate, there exist integers NV and M such that p,;;(M) > 
0 and p;;(N) > 0. For any k > 1, 


pi(M + nk + N) > pij(M)(p33(k))"pji(N) 


(indeed, the path Xp = 1, Xy = j,Xu+k = J, ---» Xmink = J, Xminktn = 7 is 
just one way of going from 7 toz in M+nk+ N steps). Therefore, for any k > 1 
such that p;;(k) > 0, we have py(M +nk + N) > 0 for all n > 1. Therefore, d; 
divides M+nk+N for all n > 1, and in particular, d; divides k. We have therefore 
shown that d; divides all k such that p,;;(k) > 0, and in particular, d; divides d;. 
By symmetry, d; divides d;, and therefore, finally, dj = dj. 


We may therefore speak of the period of a communication class or of an irre- 
ducible chain. 


The important result concerning periodicity is the following. 


Theorem 9.1.17 Let P be an irreducible stochastic matrix with period d. Then 
for all states i,j there exist m > 0 and ng > 0 (m and no possibly depending on 
i,j) such that 

pij(m+nd) > 0, for alln > no. 


Proof. It suffices to prove the theorem for 7 = 7. Indeed, there exists an m 
such that p;;(m) > 0, because j is accessible from i, the chain being irreducible, 
and therefore, if for some no > 0 we have p;;(nd) > 0 for all n > no, then 
pig(m + nd) > piy(m)p;j(nd) > 0 for all n > no. The rest of the proof is an 
immediate consequence of a classical result of number theory.! Indeed, the GcD 
of the set A = {k > 1; p;;(k) > 0} is d, and A is closed under addition. The set A 
therefore contains all but a finite number of the positive multiples of d. In other 
words, there exists an ng such that n > no implies p;;(nd) > 0. 


Behavior of a Markov chain with period 3 


' Let d be the g.c.d of A = {an ;n > 1}, aset of positive integers that is closed under addition. 
Then A contains all but a finite number of the positive multiples of d. 
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Stationary Distributions 


The central notion of the stability theory of discrete-time HMCs is that of a sta- 
tionary distribution. 


Definition 9.1.18 A probability distribution 1 satisfying 
a? —q™P (9.6) 


is called a stationary distribution of the transition matrix P, or of the corresponding 
HMC. 


The global balance equation (9.6) says that for all states 2, 


m(i) = 5° (3) Djs 


jcE 


Iteration of (9.6) gives 7? = 27 P" for all n > 0, and therefore, in view of (9.1), if 
the initial distribution v = 7, then v,, = 7 for all n > 0. Thus, if a chain is started 
with a stationary distribution, it keeps the same distribution forever. But there is 
more, because then, 


P(X = to, Xnti = hiyans Xntk = tk) = PX. = to) Digi: +++ Pip_rig 
= (to) Piotr «+ - Pixg—rig 


does not depend on n. In this sense the chain is stationary. One also says that 
the chain is in a stationary regime, or in steady state. In summary: 


Theorem 9.1.19 An HMC whose initial distribution is a stationary distribution 
is stationary. 


The balance equation t’P = x’, together with the requirement that 7 be a 
probability vector, that is, 771 = 1 (where 1 is a column vector with all its entries 
equal to 1), constitute when £ is finite, |£]+1 equations for || unknown variables. 
One of the |E| equations in 7?P = 7” is superfluous given the constraint 771 = 1. 
Indeed, summing up all equalities of r'P = x7 yields the equality 7P1 = 71, 
that is, 771 = 1. 


EXAMPLE 9.1.20: Two-STATE MARKOV CHAIN. Take F = {1,2} and define 


the transition matrix 
l-a a 
P= (457 125): 
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where a, 3 € (0,1). The global balance equations are 

m1) =7(1)d-—a)4+7(2)6, m(2) = m(L)a + 7(2)(1 — 8). 
These two equations are dependent and reduce to the single equation 7(1)a = 
m(2)8, to which must be added the constraint 7(1) + (2) = 1 expressing that 7 
is a probability vector. We obtain 


fie m2) = so 


EXAMPLE 9.1.21: THE EHRENFEST URN, TAKE 2. The global balance equations 
are, for i € [1, N — 1], 


(i) = x(i—1) (: a — + a(i+ yt 


and, for the boundary states, 


1 1 
7(0) = m(1)x T(N) =7(N - >: 
Leaving 7(0) undetermined, one can solve the balance equations for i = 0,1,...,N 


successively, to obtain 7(i) = 7(0) eae The value of 7(0) is then determined by 
writing that 7 is a probability vector: 1 = = m(i) = 7(0) pee (”) = 7(0)2”. 


This gives for 7 the binomial distribution of size N and parameter 3: 


w= &(") 


This is the distribution one would obtain by placing independently each particle 
in the compartments, with probability $ for each compartment. 


Stationary distributions may be many. Take the identity as transition matrix. 
Then any probability distribution on the state space is a stationary distribution. 
Also there may well not exist any stationary distribution. See Exercise 9.7.10. 


Reversible Chains 


Let {X,}nso0 be an HMC with transition matrix P and admitting a stationary 
distribution 7 > 0 (meaning z(t) > 0 for all states 7). Define the matrix Q, 
indexed by E, by 

T(t) dij = (J )DjA- (9.7) 
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This is a stochastic matrix since 
T(J) 1 : 1(i) 
iG = Tn Pu = TT 2=—— = 1; 
yh ‘ mi)? 1 (i) c 1 (9); m(i) ’ 


where the third equality uses the global balance equations. Its interpretation is 
the following: Suppose that the initial distribution of the chain is 7, in which case 
for alln > 0, alli € E, P(X, = 1%) = 7(2). Then, from Bayes’ retrodiction formula, 


Pi Xper = 1 Xn =P Xe = 7) 


P(Xn = 9 | Xn = 1) = P(Xnui = 1) > 


that is, in view of (9.7) 


We see that Q is the transition matrix of the initial chain when time is reversed. 


The following is a very simple observation that will be promoted to the rank 
of a theorem in view of its usefulness. 


Theorem 9.1.22 Let P be a stochastic matrix indexed by a countable set E, and 
let m be a probability distribution on E. Define the matrix Q indexed by E by (9.7). 
If Q is a stochastic matrix, then 7 is a stationary distribution of P. 


Proof. For fixed 7 € F', sum equalities (9.7) with respect to 7 € E to obtain 
s T(t) qij = Ss T(J) Dji + 
jek jee 


This is the global balance equation since the left-hand side is equal to 


T(t) en Gi = 72). 


Definition 9.1.23 One calls reversible a stationary Markov chain with initial 
distribution w (a stationary distribution) if for all i,j € E, we have the so-called 
detailed balance equations 

(i) pig = TF) D3 (9.8) 


We then say: the pair (P,7) is reversible. 
In this case, qj; = pij, and therefore the chain and the time-reversed chain are 


statistically the same, since the distribution of a homogeneous Markov chain is 
entirely determined by its initial distribution and its transition matrix. 
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The next result is an immediate corollary of Theorem 9.1.22. 


Theorem 9.1.24 Let P be a transition matrix on the countable state space E, 
and let m be some probability distribution on E. If for all i,j € E, the detailed 
balance equations (9.8) are satisfied, then m is a stationary distribution of P. 


EXAMPLE 9.1.25: THE EHRENFEST URN, TAKE 3. The verification of the 
detailed balance equations m(i)p;;41 = 7(¢ + L)pi4i4 is immediate. 


The Strong Markov Property 


The Markov property, that is, the independence of past and future given the 
present state, extends to the situation where the present time is a stopping time, 
a notion which we now introduce. 


Let {X,}n>0 be a stochastic process with values in the denumerable set . For 
an event A, the notation A € ¥ means that there exists a function yp: E"t! 5 
{0,1} such that 


1a) = p(Xolw),.--,Xn))- 


In other terms, this event is expressible in terms of Xo(w),...,Xn(w). Let now r 
be a random variable with values in N. It is called a X-stopping time if for all 
m €N, {r = m} € XQ". In other words, it is a non-anticipative random time 
with respect to {X,}n>0, since in order to check if 7 = m, one need only observe 
the process up to time m and not beyond. It is immediate to check that if 7 is a 
X(-stopping time, then so is 7+ n for all n > 1. 


EXAMPLE 9.1.26: RETURN TIME. Let {X;,,}n>0 be an HMC with state space LE. 
Define for i € E the return time to i by 


T, := inf{n > 1; X, =i} 


using the convention inf @ = oo for the empty set of N. This is a Xj-stopping 
time since for all m € N, 


{T; m} {X, i, Xo ee. Coe | # ty Kis =i}. 


Note that TJ; > 1. It is a “return” time, not to be confused with the closely 
related “hitting” time of 7, defined as S; := inf{n > 0; X, = i}, which is also a 
X-stopping time, equal to T; if and only if Xo F 7. 
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EXAMPLE 9.1.27: SUCCESSIVE RETURN TIMES. This continues the previous 
example. Let us fix a state, conventionally labeled 0, and let 75 be the return time 
to 0. We define the successive return times to 0, %, k > 1 by ™ = 7p and for 
k>1, 

Troi = inf{n > | +1; X, = 0} 
with the above convention that inf @ = oo. In particular, if 7% = oo for some k, 
then 74¢ = co for all 2 > 1. The identity 


m-1 
{m%=m}= {5: 1tx,=0} =k -1, Xm = ob 
n=1 


for m > 1 shows that 7 is a Xj-stopping time. 


Let {Xn}n>0 be a stochastic process with values in the countable set E and let 
7 be arandom time taking its values in N := NU {+oo}. In order to define X, 
when T = oo, one must decide how to define X,,. This is done by taking some 
arbitrary element A not in F, and setting 


Xo =A. 
By definition, the “process after 7” is the stochastic process 
{op Anat = {Ange fad : 


The “process before 7,” or the “process stopped at 7,” is the process 


{XT }nso = {Xnartnso; 
which freezes at time 7 at the value X,. 
Theorem 9.1.28 Let {X,}n>0 be an HMC with state space E and transition ma- 


tric P. Let rT be a Xj-stopping time. Then for any state i € E, 


(a) Given that X, =1, the process after T and the process before T are indepen- 

dent. 

(3) Given that X, =i, the process after T is an HMC with transition matria P. 
Proof. (a) We have to show that for all times k > 1, n > 0, and all states 
10, tee bray by Jy -e5 > Ik: 

P(X741 =Sis--- Xr+k = Jk | X= 1, Xrn0 =lo,-. -,Xran _ in) 
= P(Xy41 = fiy--+;Xrtk = 9. Ae HA) 
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We shall prove a simplified version of the above equality, namely 
P(Xp4k = J | X, = 1, Xran = in) = PU Xk = J | X, = i) . (x) 


The general case is obtained by the same arguments. The left-hand side of (x) 
equals 


P(X7446 J; XxX, i, Xran in) 
P(X, = 1, Xran <i in) 

The numerator of the above expression can be developed as 
se =e = j, Xp =1, Xran = tn). (xx) 
reN 

(The sum is over N because X, = 7 #4 A implies that 7 < oo.) But 

PG =F Xp+k = j, Xp = 1, Xran = ta) 
= P(Xr4k =j|X, = 1, Xran ling r)P(r r, Xran =n Xr =i), 


and since rAn <r and {7 =r} € Xj, the event B:= {Xan = in, Tt = 1} is in 
Xj. Therefore, by the Markov property, P(X,+4% = 7 | X> = 4, Xpan = tn, T = 1} = 
P(X,sn = j| Xr = 7) = pij(k). Finally, expression (xx) reduces to 


S- pig(k) P(r =?!, Xran _ Ins X, _ 1) = Dij(k) P(X, ai, Xran _ in) : 
reN 


Therefore, the left-hand side of (x) is just p;;(k). Similar computations show that 
the right-hand side of (x) is also p;;(k), so that (a) is proven. 


(3) We must show that for all states 7,7, k,tn-1,..-,%1, 
PX =k | Xrin = J Xptn-1 = In—1, . .,X, = i) 
= P(Xp4n41 =k| Xt4n = J) = Pik - 
But the first equality follows from the fact proven in (a) that for the stopping time 


7’ =7T +N, the processes before and after 7’ are independent given X, = j. The 
second equality is obtained by the same calculations as in the proof of (a). 


The Cycle Independence Property 


Consider a Markov chain with a state conventionally denoted by 0 such that 
P)(To < co) = 1. In view of the strong Markov property, the chain starting 
from state 0 will return infinitely often to this state. Let 7 = 7po,7,... be the 
successive return times to 0, and set 7 = 0. 
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By the strong Markov property, for any k > 1, the process after 7 is indepen- 
dent of the process before 7, (observe that condition X,, = 0 is always satisfied), 
and the process after 7, is a Markov chain with the same transition matrix as the 
original chain, and with initial state 0, by construction. Therefore, the successive 
times of visit to 0, the pieces of trajectory 


2. One ne remeee Geer oo k 2 0, 


are independent and identically distributed. Such pieces are called the regenerative 
cycles of the chain between visits to state 0. Each random time 7; is a regeneration 
time, in the sense that {X;,4n}n>0 is independent of the past Xo,...,X7,-1 and 
has the same distribution as {X,}n>o. In particular, the sequence {T,_ — Tr—-1}4>1 
1s IID. 


EXAMPLE 9.1.29: RETURNS TO ZERO OF THE 1-D SYMMETRIC WALK. Let 
T, = Tp, 72,... be the successive return times to state 0 of the random walk on Z 
of Example 9.1.4 with p = 5. We shall admit that Po(Zo < co) = 1, a fact that 
will be proved in the next section, and obtain the probability distribution of 7 
given Xo = 0. 


Observe that for n > 1, 
Po(X2n = 0) = S© Po(7 = 2n), 
k>1 
and therefore, for all z € C such that |z| < 1, 


FRX = O28" = Alin = aye = Bale. 


n>1 k>1 n>1 k>1 


But tT = 71 + (72 — 71) +--+ + (7 — 7-1) and therefore, since 7; = Tp, 


Eo|z™] = (Zolz”])*. 


In particular, 
1 
Pike = Oe" = 
>, 0( Xe jz 1— Eq] 


n>0 


(note that the latter sum includes the term for n = 0, that is, 1). Direct evaluation 
of the left-hand side yields 


z = ‘ 
22” nln! V1 — 2 


n>0 
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Therefore, the generating function of the return time to 0 given Xo = 0 is 
E,[z”] =1-—Vv1—2. 


Its first derivative 2 


V1 — 2? 


tends to oo as z + 1 from below via real values. Therefore, by Abel’s theorem, 
Eo|To] = 00. 


We see that although given Xp = 0 the return time is almost surely finite, it has 
an infinite expectation. 


9.2 Recurrence 


In the theory of Markov chains, recurrence refers to the possibility of an infinite 
number of visits to a given state. The basic definition is in terms of return times. 


Recall that T; denotes the return time to state 7. 


Definition 9.2.1 State i € E is called recurrent if 
P(T, < 00) =1, 
and otherwise it is called transient. A recurrent state i € E such that 
E,[T;] < 00 
is called positive recurrent, and otherwise it is called null recurrent. 


The definition in terms of return times will now be connected to that in terms 
of the number of visits. 


Theorem 9.2.2 The distribution given Xp = j of Ni = 0,31 1{x,=i}, the number 
of visits to state i strictly after time 0, is 


PN 7) tite dar 21) 
P(N; = 0) =1— fy, 
where fj; = P;(T; < 00) and T; is the return time to 1. 


Proof. We first go from j to 7 (probability f;;) and then, r—1 times in succession, 
from 7 to 7 (each time with probability f;;), and the last time, that is the r + 1-st 


9.2. RECURRENCE 329 


time, we leave i never to return to it (probability 1— f;;). By the cycle independence 
property, all these “cycles” are independent, so that the successive probabilities 
multiply. 


The distribution of N; given Xo = j and given N; > 1 is geometric. This has 
two main consequences. Firstly, P;(T; < co) =1 P(N; = 00) = 1. In words: 
starting from 7, the chain almost surely returns to 7, and will then visit 7 infinitely 
often. Secondly, 


In particular, P;(T; < co) < 1 <= > Ej[Nj] < 00. 


We collect these results for future reference. For any state i € E, 


PAT, < 00) = 14 P(N; =00) =1 


and 


In particular, the event {N; = co} has P;-probability 0 or 1. 


The Potential Matrix Criterion 


The potential matrix G associated with the transition matrix P is defined by 


n>0 
Its general term 
e=> ei B=) ai eS Loses] 
n=0 n=0 n=0 n=0 


is the average number of visits to state j, given that the chain starts from state i. 


Although the next criterion of recurrence is of theoretical rather than practical 
interest, it can be helpful in a few situations, for instance in the study of recurrence 
of random walks (see the examples below). 


Theorem 9.2.3 State i € E is recurrent if and only if 


S~ pii(n) = (85). 
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Proof. This merely rephrases Eqn. (9.9). 


EXAMPLE 9.2.4: 1-D RANDOM WALK. The state space of this Markov chain is 
E := Zand the non-null terms of its transition matrix are p;i41 =D, Pii-1 = 1—Pp, 
where p € (0,1). Since this chain is irreducible, it suffices to elucidate the nature 
(recurrent or transient) of any one of its states, say, 0. We have poo(2n + 1) = 0 


and (an)! 
Mm)! ify - 
Poo(2n) = nll? ag)" s 


By Stirling’s equivalence formula n! ~ (n/e)"\V27n, the above quantity is equiva- 


lent to an 7 

Vin 
and the nature of the series }7°° 4 poo(m) (convergent or divergent) is that of the 
series with general term (x). If p 4 5, in which case 4p(1—p) < 1, the latter series 
converges, and if p = 3, in which case 4p(1 — p) = 1, it diverges. In summary, the 
states of the 1-D random walk are transient if p # s, recurrent if p = 3. 


EXAMPLE 9.2.5: 3-D RANDOM WALK. The state space of this HMC is E = 
Z>. Denoting by e;, e2, and e3 the canonical basis vectors of R* (respectively 
(1,0,0), (0, 1,0), and (0,0,1)), the nonnull terms of the transition matrix of the 
3-D symmetric random walk are given by 


il 
Px,xte; a 6 x 


We elucidate the nature of state, say, 0 = (0,0,0). Clearly, poo(2n + 1) = 0 for all 
n > 0, and (exercise) 


- (2n)! 1\” 
Poo(2n) = y see tt] , 


O<i+j<n 


This can be rewritten as 


mon 5 (0) (atin) CY" 


Using the trinomial formula 
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1 /2n 1\" 
Poo(2n) < Kuga ( ) (5) ) 


n! 


we obtain the bound 


where 


K,= max — ——.. 
o<itj<n iljl(n —i— 7)! 


For large values of n, K, is bounded as follows. Let ig and jo be the values of 7, 
j that maximize n!/(i!j!(n + —i — 7)!) in the domain of interest 0 < i+j <n. 
iFrom the definition of ig and jo, the quantities 
n) 
n) 
(ig + 1) Yolln —t9.— jo 1)!’ 


to!(Jo + 1)'(m — 49 — Jo — 1)!’ 


are bounded by 
n! 


to!jo!(n — to — Jo)! 
The corresponding inequalities reduce to 


n—ig9 —1< 279 <n—-—t9+1andn— jo —1< Zig <n—- Jot 1, 


and this shows that for large n, ig ~ n/3 and jy ~ n/3. Therefore, for large n, 


poo(2n) ~ ae (7) 


By Stirling’s equivalence formula, the right-hand side of the latter equivalence is 
in turn equivalent to 
3/3 


2(an)3/? ’ 
the general term of a convergent series. State 0 is therefore transient. 


One might wonder at this point about the symmetric random walk on Z?, which 
moves at each step northward, southward, eastward and westward equiprobably. 
Exercise 9.7.25 asks you to show that it is null recurrent. Exercise 9.7.26 asks you 
to prove that the symmetric random walks on Z?, p > 4, are transient. 
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A theoretical application of the potential matrix criterion is to the proof that 
recurrence is a (communication) class property. 


Theorem 9.2.6 Jf i and j communicate, then they are either both recurrent or 
both transient. 


Proof. By definition, i and 7 communicate if and only if there exist integers M and 
N such that p;;(M) > 0 and p,;;(N) > 0. Going from i to 7 in M steps, then from 
j to j in n steps, then from 7 toi in N steps, is just one way of going from i back 
toiin M+n+N steps. Therefore, p;(M+n+ N) > pij(M) x pjj(n) x pil). 
Similarly, p;;(N ++ M) > py(N) x p(n) x py(M). Therefore, with a := 
pij(M) pji(N) (a strictly positive quantity), we have py(M + N +n) > ap;;(n) 
and p,;;(M+N+n) > ap(n). This implies that the series }7°°) p(n) and 
So 9 Pj(”) either both converge or both diverge. The potential matrix criterion 
concludes the proof. 


Invariant Measure 
This notion extends that of a stationary distribution and plays a central role in 


the recurrence theory of Markov chains. 


Definition 9.2.7 A non-trivial (that is, non-null) vector x (indexed by E) of non- 
negative real numbers (notation: 0 < x < oo) is called an invariant measure of the 
stochastic matrix P (indexed by E’) if 


ae PY (9.10) 


Theorem 9.2.8 Let P be the transition matrix of an irreducible recurrent HMC 
{Xn}nso- Let 0 be an arbitrary state and let To be the return time to 0. Define for 


alli € E 
To 
ye Leo] (9.11) 
ul 


(For i # 0, x; is the expected number of visits to state i before returning to 0.) 
Then, 0< x < co and & is an invariant measure of P. 


L, = Hig 


Proof. We make three preliminary observations. First, it will be convenient to 
rewrite (9.11) as 


xr; = Ep 


S Lets 


n>1 
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Next, when 1 <n < 75, X, = 0 if and only if n = 7>. Therefore, 


Xr = cl 
Also, 
Sie (= te] ay 
1€B n>1 n>1 1€BR n>1 


and therefore 


S > ai = Ey[T]- (9.12) 


i€E 
We introduce the quantity 
oPos(m) = Bollex, =a liens] = AiG #0,*++ An 4 0,X, = 7). 


This is the probability, starting from state 0, of visiting 7 at time n before returning 
to 0. From the definition of x, 


t= t» oPoi(n) « (T) 


We first prove (9.10). Observe that opoi(1) = po:, and, by first-step analysis, for 
all n > 2, opoi(n) = i 40 oPoj (2 — 1)p;;. Summing up all the above equalities, 
and taking ({) into account, we obtain 


Xi = poi + >. vipji 
JA 
that is, (9.10), since x = 1. 
Next we show that 2; > 0 for alli € E. Indeed, iterating (9.10), we find 


x? =2'P", that is, since zp = 1, 
r= ye T;pji(N) = poi(n) + x. ©jpji(M) « 
jCE j#0 


If x; were null for some i € EL, 7 £ 0, the latter equality would imply that poi(n) = 
0 for all n > 0, which means that 0 and 7 do not communicate, in contradiction to 
the irreducibility assumption. 


It remains to show that x; < oo for all i € EF. As before, we find that 


i ty = S © x;pj0(n) 


jeE 
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for all n > 1, and therefore if 2; = co for some 2, necessarily pio(n) = 0 for all 
n > 1, and this also contradicts irreducibility. 


Theorem 9.2.9 The invariant measure of an irreducible recurrent HMC is unique 
up to a multiplicative factor. 


Proof. In the proof of Theorem 9.2.8, we showed that for an invariant measure y 
of an irreducible chain, y; > 0 for all 7 € #, and therefore, one can define, for all 
i,j € E, the matrix Q by 


Yi 
Gi = —Pij - (x) 
eS ge 
It is a transition matrix, since Vien Gi = Lee YiPiy = a = 1. The general 
a fi] 
term of Q” is 
Yi 
Gyi(n) = —pij(m) . (x) 
Yi 


Indeed, supposing (xx) true for n, 


Yk Yi 
qin +1) = be GkGi(n) = > rr a n) 


keE keE 


Yi 
= = So pin(n nN) Pkj = Foam teh 
Yo heb 45 


and (xx) follows by induction. 


Clearly, Q is irreducible, since P is irreducible (just observe that q;;(n) > 0 
if and only if p;;(n) > 0 in view of (xx)). Also, pi(n) = qu(n), and therefore 
no i(™) = D2,s0 Pii(m), and therefore Q is recurrent by the potential matrix 
criterion. Call g;;(n) the probability, relative to the chain governed by the tran- 
sition matrix Q, of returning to state 7 for the first time at step n when starting 
from 7. First-step analysis gives 


Gio n+ 1) = Gig Gj0(N 
IFO 


that is, using (x), 


yigio(n + 1) = Yi (yj930(7)) p55 
740 
Recall that opo;(n + 1) = > j40 0Poj()pji, Or, equivalently, 


Yo oPoi(n + 1) =S—( oPoj())D je - 
J#0 
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We therefore see that the sequences {Yo opPoi(n)} and {y:gio(n)} satisfy the same 
recurrence equation. Their first terms (n = 1), respectively yo opoi(1) = yopo: and 
Yigio(1) = yiGio, are equal in view of (x). Therefore, for all n > 1, 


Summing up with respect to n > 1 and using >>)... gio(n) = 1 (Q is recurrent), 


we obtain that x; = one 


Equality (9.12) and the definition of positive recurrence give the following. 


Theorem 9.2.10 An irreducible recurrent HMC is positive recurrent if and only 
if its invariant measures x satisfy 


Wea 


tek 


The Stationary Distribution Criterion of Positive Recurrence 


An HMC may well be irreducible and possess an invariant measure, and yet not be 
recurrent. The simplest example is the 1-D non-symmetric random walk, which 
was shown to be transient and yet admits 7; = 1 (i € Z) for invariant measure. 
However, it turns out that the existence of a stationary probability distribution is 
necessary and sufficient for an irreducible chain (not a priori assumed recurrent) 
to be recurrent positive. 


Theorem 9.2.11 An irreducible HMC is positive recurrent if and only if there 
exists a stationary distribution. Moreover, the stationary distribution m 1s, when 
it exists, unique, and 7 > 0. 


Proof. The direct part follows from Theorems 9.2.8 and 9.2.10. For the converse 
part, assume the existence of a stationary distribution 7. Iterating 7” = 77P, we 
obtain 7” = 17P”, that is, for all i € E, 7(t) = jem 7(J)pji(n). If the chain 
were transient, then, for all states 7, 7, 


_ p(n) =0. 
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The following is a formal proof:? 


S- pyi(n) = Be ; = k)py(n — k) 


n>1 n>1 k>1 
= >. P(Ti = k) 2 puiln —k 
k>1 n>1 


IA 


(= PT; = 0) (= 7) 


= PG, < 00) (Sout )) = palo) <a 


n>1 n>1 


In particular, lim, p;;(n) = 0. Since p;;(n) is bounded uniformly in j and n by 1, 
by the dominated convergence theorem for series:? 


m(é) = lim > (J) pyi(n) = So (9) (tim puln)) =0. 
jee jee 

This contradicts the assumption that 7 is a stationary distribution (}0,-,7(%) = 

1). The chain must therefore be recurrent, and by Theorem 9.2.10, it is positive 

recurrent. 


The stationary distribution 7 of an irreducible positive recurrent chain is unique 
(use Theorem 9.2.9 and the fact that there is no choice for a multiplicative factor 
but 1). Also recall that (2) > 0 for all i € E (see Theorem 9.2.8). 


Theorem 9.2.12 Let a be the unique stationary distribution of an irreducible 
positive recurrent HMC, and let T; be the return time to state i. Then 


n(i)E,[T] = 1. (9.13) 


Proof. This equality is a direct consequence of expression (9.11) for the invariant 
measure. Indeed, 7 is obtained by normalization of x: for alli € E, 
= 
ice U5 
and in particular, for i = 0, recalling that xj = 1 and using (9.12), 
1 
(0) = ——. 
Eo(To| 


? Rather awkward, but using only the elementary tools available. 

3 Let {ank}ns1. 45, be an array of real numbers such that, for some sequence {b;,},5, of 
non-negative numbers satisfying bp ar bp < ov, it holds that for alln > 1, k > 1, lang] < be. 
If moreover for all k > 1, limnzoo Gnk = ax, then limntoo bpraw 1ank = beae , ax. (Note that this 
result is a particular case of the dominated convergence theorem, Theorem 5.1.3.) 
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Since state 0 does not play a special role in the analysis, (9.13) is true for alli € E. 


The situation is extremely simple when the state space is finite. 
Theorem 9.2.13 An irreducible HMC with finite state space is positive recurrent. 
Proof. We first show recurrence. We have 
> p(n) = 1, 
jeE 
and in particular, the limit of the left-hand side is 1. If the chain were transient, 
then, as we saw in the proof of Theorem 9.2.11, for all i,j € E, 


and therefore, since the state space is finite 


a contradiction. Therefore, the chain is recurrent. By Theorem 9.2.8 it has an 
invariant measure x. Since EF is finite, }),.,2; < 00, and therefore the chain is 
positive recurrent, by Theorem 9.2.10. 


EXAMPLE 9.2.14: THE REPAIR SHOP, TAKE 2. Recall that this Markov chain 
satisfies the recurrence equation 


Xnat = (Xn -1)b 4+ Zags, (9.14) 


where at = max(a,0). The sequence {Z,}n>1 is assumed to be ID, independent 
of the initial state Xo, and with common probability distribution 


P(Z, =k) =ay, k>0 


of generating function gz. 


This chain is irreducible if and only if P(Z; = 0) > 0 and P(Z; > 2) > 0 as 
we now prove formally. Looking at (9.14), we make the following observations. If 
P(Zy41 = 0) = 0, then X,41 > X,, a.s. and there is no way of going from i to i—1. 
If P(Znai < 1) = 1, then X,41 < Xn, and there is no way of going from 7 to i+1. 
Therefore, the two conditions P(Z, =0) > 0 and P(Zj > 2) > 0 are necessary 
for irreducibility. They are also sufficient. Indeed if there exists an integer k > 2 
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such that P (Z,41 = k) > 0, then one can jump with positive probability from any 
i>Otoi+k—1>i0rfromi=0tok>0. Also if P(Z,4; =0) > 0, one can 
step down from 7 > 0 to i — 1 with positive probability. In particular, one can go 
from i to 7 < i with positive probability. Therefore, one way to travel from 7 to 
j >is by taking several successive steps of height at least k — 1 in order to reach 
a state 1 > i, and then (in the case of | > i) stepping down one stair at a time 
from / to i. All this with positive probability. 


EXAMPLE 9.2.15: THE REPAIR SHOP, TAKE 3. Assuming irreducibility (see 
Example 9.2.14), we now seek a necessary and sufficient condition for positive 
recurrence. For any complex number z with modulus not larger than 1, it follows 
from the recurrence equation (9.14) that 


gkntitl — ea) gin = (2 — 1px,=0} + 21,x,=0}) gin ) 


and therefore zz*n+? — zXnz24+1 = (z — 1)1,x,,=0}27"*!. {From the independence 
of X, and Zr41, Elz*24+1] = E[z*-]gz(z), and E[lyx,=0}27"t!] = 1(0)gz(z), 
where 7(0) = P(X, = 0). Therefore, zE[z*"+1] — gz(z) E[z*"] = (2 —1)1(0)gz(z). 
But in steady state, E[z*»+*] = E[z*»] = gx(z), and therefore 


gx (2) (2 — g2(z)) = x(0)(2 — Igz(z)- (9.15) 


This gives the generating function gx(z) = }7>°, 7(i)z', as long as 7(0) is available. 
To obtain (0), differentiate (9.15): 


Gx (2) (2 — ga(2)) + gx(2) (1 - 92 (2)) = 70) (gal) + (2 - 1)gz(2)) 5 


and let z = 1, to obtain, taking into account the equalities gx(1) = gz(1) = 1 and 
9z(1) = BZ), 


n(0) =1—5[Z]. (9.16) 


But the stationary distribution of an irreducible HMC is positive, hence the neces- 
sary condition of positive recurrence: 


It turns out that this condition is also sufficient for positive recurrence. 


From (9.15) and (9.16), we have the generating function of the stationary dis- 


tribution: 
co 


So a(t = (1— [ye e) | (9.17) 


imo 2— ga(2) 
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If E[Z,| > 1, the chain is transient, as a simple argument based on the strong law 
of large numbers shows. In fact, X, = Xo + op_, Ze — 2 + Dp 1tx,=0}, and 


therefore 
n 


oo Fe ts, 
k=1 


k=1 


which tends to oo because, by the strong law of large numbers, 


Za > E[Z|—-1>0. 


This is of course incompatible with recurrence. 


In the case E{Z,] = 1, there are only two possibilities left: transient or null 
recurrent. It turns out that the chain is null recurrent in this case. 


EXAMPLE 9.2.16: THE PURE RANDOM WALK ON A GRAPH. Consider a 
finite non-directed connected graph G = (V,€) where V is the set of vertices, or 
nodes, and € is the set of edges. Let d; be the index of vertex i (the number of 
edges “adjacent” to vertex i). Since there are no isolated nodes (a consequence of 
the connectedness assumption), d; > 0 for alli € V. Transform this graph into a 
directed graph by splitting each edge into two directed edges of opposite directions, 
and make it a transition graph by associating to the directed edge from i to 7 the 
transition probability z (see the figure below). Note that }7j<y di = 2|€|. 


A random walk on a graph 


The corresponding HMC with state space F = V is irreducible (G is connected). 
It therefore admits a unique stationary distribution 7, which we attempt to find 
via Theorem 9.1.24. Let 7 and 7 be connected by an edge, and therefore p,; = + 
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and pj = +, so that the detailed balance equation between these two states is 
¢] 


n> = az 


4 


-1 
This gives 7(i) = Kd;, where K is obtained by normalization: K = oe d;) = 
(2|E|)~!. Therefore 


The lazy random walk on the graph is, by definition, the Markov chain on V 
with the transition probabilities p;; = $ and for i,7 € V such that i and j are 
connected by an edge of the graph, p;; = re This modified chain admits the 
same stationary distribution as the original random walk. The difference is that 


the lazy version is always aperiodic, whereas the original version may be periodic. 


The stationary distribution criterion can also be used to prove instability. 


Birth-and-Death Markov Chain 


We first define the birth-and-death process with a bounded population. The state 


space of such a chain is F = {0,1,...,.N} and its transition matrix is 
To Po 
mM T1 Pl 
q2 2 Pp2 
P= F 


Go Pi 


QN-1 TN-1 PN-1 
PN rn 


where p; > 0 for all i € E\{N}, gq; > 0 for all i € E\{0}, r; > 0 for all i € EZ, 
and pj + q@+r; = 1 for alli € E. The positivity conditions placed on the p;’s 
and q;’s guarantee that the chain is irreducible. Since the state space is finite, it is 
positive recurrent (Theorem 9.2.13), and it has a unique stationary distribution. 
Motivated by the Ehrenfest HMC, which is reversible in the stationary state, we 
make the educated guess that the birth and death process considered has the same 
property. This will be the case if and only if there exists a probability distribution 7 
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on F satisfying the detailed balance equations, that is, such that for alll <i< N, 
(i —1)pi-1 = w(t)q. Letting wo = 1 and for all 1 <i< N, 


i 


i = II Pk-1 


qk 


we find that 


indeed satisfies the detailed balance equations and is therefore the (unique) sta- 
tionary distribution of the chain. 


We now consider the unbounded birth-and-death process. This chain has the 
state space F' = N and its transition matrix is as in the previous example (only, it is 
unbounded on the right). In particular, we assume that the p,’s and q;’s are positive 
in order to guarantee irreducibility. The same reversibility argument as above 
applies with a little difference. In fact we can show that the w;’s defined above 
satisfy the detailed balance equations and therefore the global balance equations. 
Therefore the vector {w;}icr is the unique, up to a multiplicative factor, invariant 
measure of the chain. It can be normalized to a probability distribution if and 


only if 
by wj <OOo. 
j=0 


Therefore, in this case and in this case only there exists a (unique) stationary 
distribution, also given by (9.18). 


Note that the stationary distribution, when it exists, does not depend on the 
r;’s. The recurrence properties of the above unbounded birth-and-death process 
are therefore the same as those of the chain below, which is however not aperiodic. 
For aperiodicity, it suffices to suppose at least one of the r;’s is positive. 


po=1 Pl po Pi-1 Pi 
GK H so 
rn tp) 3 Gi Git 


We now compute, for the (bounded or unbounded) irreducible birth-and death 
process, the average time it takes to reach a state b from a state a < b. In fact, we 
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shall prove that 


Eh = > : W; (9.19) 


dey: (*) 


For this, consider for any given k € {0,1,..., N} the truncated chain, which moves 
on the state space {0,1,...,k} as the original chain, except in state k where it 
moves one step down with probability q, and stays still with probability p, + rx. 
Write E for expectations of the modified chain. The unique stationary distribution 


of this chain is given by 
oe Wwe 
Te = k 


pa We 


for all 0 < ¢ < k. First-step analysis shows that E, [Tr] = (Te + pe) xX 1+ 
dk (1 = a (nid), that is 


BE, =1+ aha hy). 
Also 

a i. eee 

Ex [Tr] = = = — W;, 

[Tk] won j 


Tk 


and therefore, since E,_} [Ty] = Ex—1 [Th], we have (x). 
In the special case where (p;,q;,7j) = (p,9,7) for all 7 4 0,N, (po, 40,70) = 
(p,q+r,0) and (py, qv, Tn) = (0,p+r,q), we have w; = (2) ,andforl<k<N, 


Br [Ta] = Tye (2) 7 =r ( () 


j=0 


In the further particularization where p= q, w; = 1 for all 7 and 


k 
Ey [Tr] Se 
Pp 
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Foster’s Theorem 


The stationary distribution criterion of positive recurrence of an irreducible chain 
requires solving the balance equation, an often hopeless enterprise. The following 
sufficient condition is more tractable and indeed quite powerful. 


Theorem 9.2.17 Let P be an irreducible transition matrix on the countable state 
space F. Suppose that there exists a function h: E — R such that inf; h(i) > —oo, 


SS" pith(k) <0o (EF), (9.20) 


keE 


and 


SY puh(k) < Ali) —e GF), (9.21) 


keE 
for some finite set F and some « > 0. Then the corresponding HMC is positive 
recurrent. 


Proof. Recall the notation Xf for (Xo,...,X;,). Since inf; h(z) > —oo, one may 
assume without loss of generality that h > 0, by adding a constant if necessary. 
Call r the return time to F and let Y,, := h(X;)1yr<7}. Equality (9.21) implies 
that E[h(Xn+1) | Xn = 7] < h(i) — € for alli Zg F. Fori¢ F, 


E[¥nsi | Xo] = Eil¥nsiltnery | XO] + Fi(Ynsiln>7y | X01] 
= Ex[Yntil{n<r} | X39] = E,{h(Xn+1) incr} | X9] 
= Lin<r} Ei [h(Xn41) | X91 = Lin<r} Ei[h(Xn41) | Xn 
< Iinery (Xn) _— €lin<r} ) 
where the third equality comes from the fact that 1,,<;} is a function of Xf (The- 


orem 2.4.6), the fourth equality is the Markov property and the last inequality is 
true because P;-a.s., X, ¢ F on n < 7. Therefore, P,-a.s., 


Ex[Yn41 | P| < Yn ~~ €lin<r} 
and, taking expectations, 
0< E{Ynai] < E[Y,] - P(t > 7). 


Iterating the above equality and taking into account the fact that Y, is non- 
negative, we obtain 


k=0 


344 CHAPTER 9. MARKOV CHAINS 


But Yo = h(i), Py-a.s., and S07. Pi(r > k) = E;[r]. Therefore, for all i Z F, 
E,[r] < e“*h(i). 
For j € F, by first-step analysis 


igF 


Therefore Ej[r] < 1+ € Dep pyh(i), a finite quantity in view of assumption 
(9.20): the return time to F’ starting anywhere in F' has finite expectation. Since 
F isa finite set, this implies positive recurrence in view of the following lemma. 


Lemma 9.2.18 Let {Xn}n>o be an irreducible HMC, let F' be a finite subset of the 
state space E and let r(F) be the return time to F. If E;[t(F)] < oo for allj € F, 
the chain is positive recurrent. 


Proof. Exercise 9.7.15. 


The function hf in Foster’s theorem is called a Lyapunov function because it 
plays a role similar to the Lyapunov functions in the stability theory of ordinary 
differential equations. It has a tendency to decrease along the trajectories of the 
process, at least outside a finite set of states, called the refuge. Since it is non- 
negative, it cannot decrease forever and therefore it eventually enters the refuge. 


The following corollary of Foster’s theorem is sometimes referred to as Pakes’ 


lemma. 


Corollary 9.2.19 Let {X,}n>0 be an irreducible HMC on E =N such that for all 
n>0 and allie E, 


id Hee (9.22) 
and 
lim sup E[X,41 — X, | X, =i] <0. (9.23) 
itoo 


Such an HMC is positive recurrent. 


Proof. Let —2e be the left-hand side of (9.23). In particular, « > 0. By (9.23), 
for i sufficiently large, say ¢ > io, E[Xn4i — Xn | Xn = i] < —e, and therefore the 
conditions of Foster’s theorem are satisfied with h(t) =i and F = {i;7 < ig}. 
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EXAMPLE 9.2.20: A RANDOM WALK ON N. Let {Z,}n>1 be an IID sequence 
of integrable random variables with values in Z such that 


BIZ) <0, 
and define {X,,}n>0, an HMC with state space F = N, by 
Xn = (Xn ar Zana)? ; 


where Xo is independent of {Z,}n>1. Assume irreducibility (the reader is invited 
to find a necessary and sufficient condition for this). Here 


E[Xni — + | Xn = i] = Elli t+ Znsi)? — GJ 
= Elz <3 + Ant any>—al S$ ElAlz>-al- 


By dominated convergence, the limit of E[Z,1,z,5~,}] as i tends to oo is E[Z,] < 0 
and therefore, by Pakes’ lemma, the HMC is positive recurrent. 


EXAMPLE 9.2.21: THE REPAIR SHOP, TAKE 4. Continuation of Example 
9.2.15. Arguments very similar to those of the previous example show that in 
the repair shop HMC (assumed irreducible), condition E[Z] < 1 implies positive 
recurrence. 


9.3. Long-run Behavior 


The Markov Chain Ergodic Theorem 


The ergodic theorem for Markov chains gives conditions guaranteeing that empir- 
ical averages of the type 


N 

1 

WN ) I Rigerxg Ae) 
k=1 


converge to the corresponding probabilistic averages. This result is an almost 
immediate application of the strong law of large numbers. 


Proposition 9.3.1 Let {X;,}n>0 be an irreducible recurrent HMC and let x denote 
the canonical invariant measure associated with state 0 € E, which is given by 
(9.11). Define forn >1 


y(n) = s- 14x;,=0} - (9.24) 
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Let f : E> R be such that 
S(O) ao. (9.25) 


ice 


Then, for any initial distribution , P,,-a.s., 


jim TW a Fe > fai. (9.26) 


icE 


Before the proof, we shall harvest the most interesting consequences. 


Theorem 9.3.2 Let {Xn}nso be an irreducible positive recurrent Markov chain 
with the stationary distribution 7, and let f : E + R be such that 


S"lf@lr(@ < co. (9.27) 


iEeE 


Then for any initial distribution 1, P,,-a.s., 


lim — > f(%e) = Do Fal). (9.28) 


Proof. Apply Proposition 9.3.1 to f = 1. Condition (9.25) is satisfied, since in 
the positive recurrent case, )7,-,%; < 00. Therefore, P,-a.s., 


N 
lim —— = Y~a,. 
Ntoo V(N) yy 73 


Now, f satisfying (9.27) also satisfies (9.25), since x and 7 are proportional, and 
therefore, P,,-a.s., 


Hie Feny FW) = DFO 
k=1 ice 
Combining the above equalities gives, P,-a.s 
N N 
1 _ VN) 1 Vice [Oxi 
lm —) f(X;,) = lim ff (Xx) us 
Noo NU d, soo N v(N) > jek v5 


from which (9.28) follows, since 7 is obtained by normalization of x. 
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Corollary 9.3.3 Let {Xn}n>1 be an irreducible positive recurrent Markov chain 
with the stationary distribution 7, and let g: E&*++ > R be such that 


> lgléo, 5 iz) | (to)Piots Piz rte, < 00- 


409282385 
Then for all initial distributions j, P,-a.s. 


N 
ao ase : 
tim. = S5 9(Xe, Xesay---Xeee) = SY) gléos tr, ..., 4x) 4 (to)Pigiy ++ Pizvit - 


k=1 tostry.st1 


Proof. Apply Theorem 9.3.2 to the “snake chain” {(Xn, Xn41,---,Xn+z)}n>0; 
which is (see Exercise 9.7.13) irreducible recurrent and admits the stationary dis- 
tribution 


(to) Digi: _ *Pip-vit : 


Note that 
YS" altos tay. «5 tn) 4 (Go) Pins * + Piz sir = Exlg(Xo,..-,Xx)]. 
405015 005)22, 
Proof. (of Proposition 9.3.1.) Let Ty = 71,72,73,... be the successive return 


times to state 0, and define 


In view of the regenerative cycle theorem, {U,},>1 is an IID sequence. Moreover, 
assuming f > 0 and using the strong Markov property, 


To To 
EU] = Fo|>> f(Xn)} = Fo |S >>> Fe 
n=1 n=1 ick 
To 
= > f (i) Eo tes] = y fox. 
ick n=1 iCE 


This quantity is finite by hypothesis and therefore the strong law of large numbers 
applies to give 
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that is, 
1 Tn+1 
lim — 7 i). 
lim= D7 f (Xe) = DFO (9.29) 
k=To+1 i€E 


Observing that 
Ty(n) <n< Ty(n)+1 > 


we have 7 heats 
yet fo 2 tnt ea) eg ale 00 6) 


y(n) ~ y(n) = y(n) 


Since the chain is recurrent, lim,;..v(n) = oo, and therefore, from (9.29), the 
extreme terms of the above chain of inequalities tend to )Uj-, f(i)a; as n goes to 
oo, and this implies (9.26). The case of a function f of arbitrary sign is obtained by 
considering (9.26) written separately for f* = max(0,f) and f~ = max(0,—/f), 
and then taking the difference of the two equalities obtained in this way. The 
difference is not an undetermined form co — oo due to hypothesis (9.25). 


The version of the ergodic theorem for Markov chains featured in Theorem 
9.3.2 is a kind of strong law of large numbers, and it can be used in simulations to 
compute, when 7 is unknown, quantities of the type E,[f(Xo)]. 


The Markov Chain Convergence Theorem 


This is one of the fundamental theoretical results of Markov chain theory. The 
proof will be given in terms of convergence in variation and is based on coupling. 


Definition 9.3.4 (A) A sequence {Qn}n>o of probability distributions on E is said 
to converge in variation to the probability distribution B on E if 


(B) An E-valued random sequence {Xn}n>0 such that for some probability dis- 
tribution 7 on E, 
lim dy(Xn,7) =0, (9.30) 


ntoo 


is said to converge in variation to 7. 


Observe that Definition 9.3.4 concerns only the marginal distributions of the 
stochastic process, not the stochastic process itself. Therefore, if there exists an- 


other stochastic process {X/}, 9 such that X, Es X! for all n > 0, and if there 
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exists a third one {X/’},>09 such that X” X x for all n > 0, then (9.30) follows 
from 
lim dy(Xj, Xn) = 0. (9.31) 


This trivial observation is useful because of the resulting freedom in the choice of 
{X/} and {X/"}. An interesting situation occurs when there exists a finite random 
time 7 such that X/ = X” for all n > r. 


Definition 9.3.5 Two stochastic processes {X!}n>9 and {X"},>0 taking their val- 
ues in the same state space E are said to couple if there exists an almost surely 
finite random time T such that 


nets X, =X}. (9.32) 
The random variable 7 is called a coupling time of the two processes. 


Theorem 9.3.6 For any coupling time rT of {X!}n>o and {Xi }ns0, we have the 
coupling inequality 
era ©. Sie Oa) Sa (Gm ean) (9.33) 


Proof. For all AC EF, 


P(X! € A)— P(X" € A)= P(X EA, r<n)+P(Xl€ A, T>7n) 
—P(XEA, 7 <n)—P(X" EAT > 72) 
= P(X) € A, 7T>n)— P(X" EA, rT >) 
< P(X EA, tT >n)< P(r >n). 


Inequality (9.33) then follows from Lemma 7.3.2. 
Therefore, if the coupling time is P-a.s. finite, that is lim,s. P(r > n) =0, 


lim dv(Xn, 7) = lim dy(X;,,X;) = 0. 


ntoo 


Consider an HMC that is irreducible and positive recurrent. If its initial distri- 
bution is the stationary distribution, it keeps the same distribution at all times. 
The chain is then said to be in the stationary regime, or in equilibrium, or in steady 
state. 


A question arises naturally: What is the long-run behavior of the chain when 
the initial distribution jz is arbitrary? For instance, will it converge to equilibrium? 
In what sense? 


350 CHAPTER 9. MARKOV CHAINS 


The classical form of the result is that for arbitrary states i and j, 


lim pij(n) = 7(¥), (9.34) 


ntoo 
if the chain is ergodic, according to the following definition: 


Definition 9.3.7 An irreducible positive recurrent and aperiodic HMC is called 
ergodic. 


In fact, (9.34) can be drastically improved: 


Theorem 9.3.8 Let {Xn}nso be an ergodic HMC on the countable state space E 
with transition matriz P and stationary distribution 7, and let pw be an arbitrary 
inttial distribution. Then 


lim POX, =) = «| =, 
lim > |Py(Xn = 8) = x(0) 
ice 
and in particular, for all j € E, 
oe ye lpjs(n) — r(Z)| = 0. 
OSE 
In fact, for all probability distributions py and v on E, 
lim dio) P10. 


Proof. (The first two statements correspond to the particular case where rv is 
the stationary distribution 7, and particularizing further, 4 = 6;.) The proo 
will be given via the coupling method.* From the discussion preceding Definition 
9.3.5, it suffices to construct two coupling chains with initial distributions and 
v, respectively. This is done in the next lemma. 


Lemma 9.3.9 Let CR) og and (5 Vasu be two independent ergodic HMCs 
with the same transition matrix P and initial distributions js and v, respectively. 
Let r = inf{n > 0; XM = a), with T = oo if the chains never intersect. Then 
T is, in fact, almost surely finite. Moreover, the process {X/}n>0 defined by 


tie ifn <r, 


X= 
e xX®) ifn>T 


(9.35) 


is an HMC with transition matriz P. 


* For the general theory of coupling and its numerous applications, see [14]. 
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Proof. Step 1. Consider the product HMC {Z,}n>0 defined by Z,, = ese xX). 
It takes values in F x HB, and the probability of transition from (7, k) to (j, 2) inn 
steps is pjj(n)pre(n). We first show that this chain is irreducible. The probability 
of transition from (i,k) to (j,@) in n steps is p;;(n)pxe(n). Since P is irreducible 
and aperiodic, by Theorem 9.1.17, there exists an m such that for all pairs (7, 7) 
and (k,@), n > m implies p;;(n)pxe(n) > 0. This implies irreducibility. (Note the 
essential role of aperiodicity. A simple counterexample is that of the symmetric 
random walk on Z, which is irreducible but of period 2. The product of two 
independent such HMCs is the symmetric random walk on Z?, which has two 
communications classes. ) 


Step 2. Next we show that the two independent chains meet in finite time. 
Clearly, the distribution o defined by o(i,7) := m(t)m(j) is a stationary distri- 
bution for the product chain, where 7 is the stationary distribution of P. There- 
fore, by the stationary distribution criterion, the product chain is positive recur- 
rent. In particular, it reaches the diagonal of E? in finite time, and consequently, 
P(r<o)=l1. 


It remains to show that {X/},>0 given by (9.35) is an HMC with transition 
matrix P. For this we use the following lemma. 


Lemma 9.3.10 Let Xj, X$,Z,, 22 (n > 1) be independent random variables, and 
suppose moreover that Z!,Z? (n > 1) are identically distributed. Let T be a non- 


negative integer-valued random variable such that for allm € N, the event {rt = m} 


is expressible in terms of Xj, X$, Zn, 22 (n < m). Define the sequence {Zn}n>1 by 


z= Zi oifn<r, 
e |) 22 ifn> rc. 


Then, {Zn}n>1 has the same distribution as {Z}}n>1 and is independent of X4, X¢. 


Proof. For any sets C1, Cy, A;, ..., Ax, in the appropriate spaces, 


P(X¢ € C1, X83 € Ca, Ze € Aes 1 < 0 <k) 
= yep P(Xd € C1, X2 € Co, Ze € Ac, 1 <0 < kt =m) 
P(X4 € C1, X83 € C2, Z1 € Ai,..., Ze € An, T > k) 
= yt _) P(Xd € Ci, X2 € Ca, Z} € Ay <h< mr =m, 22 € A,m+1<r<k) 
P(Xi € C1, X2 € Co, Z} € Ay 1 <l< kt >k). 


352 CHAPTER 9. MARKOV CHAINS 


Since the event {7 = m} is independent of Z72,,, € Am4i,---, Zz € Ax (k > m), 


k 
= $5 P(X € Ci, XG € Co, Zp € Ag 1 <h<m,r=m)P(Z? € Amt 1 <r <k) 
m=0 

+ P(Xg € C1, X2 € Co, Zp € Av, 1 <l<k,t >k) 

k 
=) P(Xj EC, X38 € Co, Zp € Ay 1 <l<mram,Z€ Amt 1i<r<k) 
m=0 


+ P(X} € C1, Xb € Co, Zt € Ae, 1 <f< kr >k) 
= P(X$ € C1, XG € Co, Zt € Ai,..., Zp € Ar). 


Step 3. We now complete the proof. The statement of the theorem concerns 
only the distributions of {X{}n>0 (€ = 1,2), and therefore we may assume a 
representation 


Pg Ss) (n>1,0=1,2), 


where X§,Z° (n > 1,€ = 1,2) satisfy the conditions in Lemma 9.3.10. The 
random time T satisfies the condition of Lemma 9.3.10. Defining {Z,,},>1 in the 
same manner as in this lemma, we therefore have 


Xn41 = f Xn, Zn-+1) 3 


which proves the announced result. 


9.4 Absorption 


The special nature of the branching process allowed for a simple and elegant com- 
putation of the probability of absorption into state 0. We now consider the ab- 
sorption problem for HMCs with no special structure,” based only on the transition 
matrix P, not necessarily assumed irreducible. The state space FE is then decom- 
posable as F = T+ yi R,;, where Ri, R2,... are the disjoint recurrent classes and 
T is the collection of transient states. (Note that the number of recurrent classes 
as well as the number of transient states may be infinite.) The transition matrix 


5 Such as those occurring in sociology, for instance in models describing migration (whether 
geographical or sociological) of populations, for which the transition matrix is obtained empiri- 
cally. 
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can therefore be block-partitioned as 


P, 0 0 
0 P, 0 
P= 
BI) B(2) Q 
or in condensed notation, 
D 0 
P= & 4 ‘ (9.36) 


This structure of the transition matrix accounts for the fact that one cannot go 
from a state in a given recurrent class to any state not belonging to this recurrent 
class. In other words, a recurrent class is closed. 


What is the probability of being absorbed by a given recurrent class when 
starting from a given transient state? This kind of problem was already addressed 
when the first-step analysis method was introduced. This method leads to a sys- 
tem of linear equations with boundary conditions, for which the solution is unique, 
due to the finiteness of the state space. With an infinite state space, the unique- 
ness issue cannot be overlooked, and the absorption problem will be reconsidered 
with this in mind, and also with the intention of finding general matrix-algebraic 
expressions for the solutions. Another phenomenon not manifesting itself in the 
finite case is the possibility, when the set of transient states is infinite, of never 
being absorbed by the recurrent set. We shall consider this problem first, and then 
proceed to derive the distribution of the time to absorption by the recurrent set, 
and the probability of being absorbed by a given recurrent class. 


Before Absorption 


Let A be a subset of the state space F (typically the set of transient states, but 
not necessarily). We aim at computing for any initial state 7 € A the probability 
of remaining forever in A, 


u(t) = P(X, € A; r>0). 


Defining v,(7) := P(X, € A,...,Xn € A), we have, by monotone sequential 
continuity, 
tim { Un(t) = v(t). 


But for 7 € A, 


P(X, € Ay...) Xn € A Xn =F) = Do SO its Pinas 


EA in—-1€A 
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is the general term q;(n) of the n-th iterate of the restriction Q of P to the set 
A. Therefore vn(7) =) 64 %ij(m), that is, in vector notation, 


Un = Q"la, 


where 1, is the column vector indexed by A with all entries equal to 1. From this 
equality we obtain 


Un+1 = Qu, , 


and by dominated convergence v = Qu. Moreover, 04 < uv < 1,4, where Oy is the 
column vector indexed by A with all entries equal to 0. The above result can be 
refined as follows: 


Theorem 9.4.1 The vector v is the maximal solution of 
v=Qv, OnSsv<lya. 


Moreover, either v =O, or sup;c, v(i) = 1. In the case of a finite transient set T, 
the probability of infinite sojourn in T is null. 


Proof. Only maximality and the last statement remain to be proved. To prove 
maximality consider a vector u indexed by A such that u = Qu and O04 <u < 14. 
Iteration of u = Qu yields u = Q”u, and u < 1,4 implies that Q’u < Q"1y = vy. 
Therefore u < v,, which gives u < v by passage to the limit. 


To prove the last statement of the theorem, let c = sup;e4 v(t). {From v < cla, 
we obtain v < cu, as above, and therefore, at the limit, v < cv. This implies either 
v= 0, or ¢= 1. 


When the set T is finite, the probability of infinite sojourn in T is null, because 
otherwise at least one transient state would be visited infinitely often. 


Equation v = Qu reads 
v(t) = S> pizr(3) (¢€ A). 
jeA 


First-step analysis gives this equality as a necessary condition. However, it does 
not help to determine which solution to choose, in case there are several. 


EXAMPLE 9.4.2: REPAIR SHOP, TAKE 5. We shall prove in a different way a 
result already obtained previously, that is: the repair shop chain is recurrent if 
and only if p < 1. Observe that the restriction of P to A; := {i+ 1,i+2,...}, 
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namely 
a a2 43 
Q = ao ay ag oh 8 
ao ay eee ? 


does not depend oni > 0. In particular, the maximal solution of v = Qu, 04 <u < 
1,4 when A = A; has, in view of Theorem 9.4.1, the following two interpretations. 
Firstly, for i > 1, 1 — v(t) is the probability of visiting 0 when starting from i > 1. 
Secondly, (1 — v(1)) is the probability of visiting {0,1,...,2} when starting from 
i +1. But when starting from i +1, the chain visits {0,1,...,i} if and only if it 
visits 7, and therefore (1 — v(1)) is also the probability of visiting i when starting 
from i+ 1. The probability of visiting 0 when starting from i+ 1 is 


1—v(i+1) = (1—(1)) — (2), 


because in order to go from 7+ 1 to 0 one must first reach 7, and then go to 0. 
Therefore, for all 7 > 1, 


where 3 = 1 — v(1). To determine £, write the first equality of v = Qu: 


o(1) = ayu(1) + agv(2) +--+, 


that is, 


(1— 8) =a,(1— 8) +ap(1— B2)4---. 


Since )7j.) ai = 1, this reduces to 


B=9(8), (x) 


where g is the generating function of the probability distribution (a,,k > 0). Also, 
all other equations of v = Qu reduce to (x). 


Under the irreducibility assumptions aj > 0, a) + a, < 1, (*) has only one 
solution in [0,1], namely 6 = 1 if p < 1, whereas if p > 1, it has two solutions in 
(0, 1], this probability is 8 = 1 and 6 = § € (0,1). We must take the smallest 
solution. Therefore, if p > 1, the probability of visiting state 0 when starting from 
state i > 1 is 1 — u(t) = 8 < 1, and therefore the chain is transient. If p < 1, the 
latter probability is 1 — v(i) = 1, and therefore the chain is recurrent. 
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EXAMPLE 9.4.3: 1-D RANDOM WALK, TAKE 3. The transition matrix of the 
random walk on N with a reflecting barrier at 0, 
0 1 
q 0 
PS q 


where p € (0,1), is clearly irreducible. Intuitively, if p > q, there is a drift to the 
right, and one expects the chain to be transient. This will be proved formally by 
showing that the probability u(2) of never visiting state 0 when starting from state 
i > 1 is strictly positive. In order to apply Theorem 9.4.1 with A = N — {0}, we 
must find the general solution of u = Qu. This equation reads 


u(1) = pu(2), 
u(2) qu(1) + pu(3) , 
u(3) = qu(2) + pu(4), 


. J 
and its general solution is u(7) = u(1) ae (2) . The largest value of u(1) re- 


specting the constraint u(i) € [0, 1] is u(1) = 1— ( ) . The solution v(?) is therefore 


Ua 
p 


Time to Absorption 


We now turn to the determination of the distribution of 7, the time of exit from the 
transient set T. Theorem 9.4.1 says that v = {v(i)}ier, where v(t) = P;(r = 00), 
is the largest solution of v = Qu subject to the constraints Or < v < 17, where 
Q is the restriction of P to the transient set T. The probability distribution of 7 
when the initial state is 7 € T is readily computed starting from the identity 


Pit =n) = P(t >n)— P(r >n+1) 


and the observation that for n > 1, {r > n} = {X,_1 € T}, from which we obtain, 
forn > 1, 


Pit =n) = P(Xp-1 € T) — P(Xn € T) = Y-(pyy(n — 1) — piy(n)). 


j€T 
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Now, pi;(n) (i, 7 € T) is the general term of the matrix Q”, and therefore: 
Theorem 9.4.4 

P(r =n) = {(Q"™ — Q”)17};. (9.37) 
In particular, if P;(r = co) = 0, 


P(r > n) = {Q"17}:. 
Proof. Only the last statement remains to be proved. From (9.37), 


m—-1 


P(n<7r<nt+m) => {(Qr4 — Qh 17}, 


j=0 
= {(Q"— Qh") Ir}, , 
and therefore, if P;(7 = 00) = 0, we obtain (9.37) by letting m f oo. 


Final Destination 


We seek to compute the probability of absorption by a given recurrent class when 
starting from a given transient state. As we shall see later, it suffices for the theory 
to treat the case where the recurrent classes are singletons. We therefore suppose 
that the transition matrix has the form 


pe ( E a (9.38) 


Let f;; be the probability of absorption by recurrent class R; = {7} when starting 
from the transient state i. We have 


a ft 0 
P @ La 


where L, = (I+ Q+---+ Q")B. Therefore, lim,t.L, = SB. For i € T, the 
(i, 7) term of L,, is 


Now, if Tr, is the first time of visit to R; after time 0, then 
L(t, J) = PAT < n), 
since R; is a closed state. Letting n go to oo gives the following: 


Theorem 9.4.5 For an HMC with transition matrix P of the form (9.38), the 
probability of absorption by recurrent class R; = {j} starting from transient state 
1 1s 

P(TR, < 00) = (SB). Rn, - 
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The general case, where the recurrence classes are not necessarily singletons, 
can be reduced to the singleton case as follows. Let P* be the matrix obtained 
from the transition matrix P, by grouping for each j the states of recurrent class 
R; into a single state 7: 


10 0 0 
0 1 0 0 
P* = (9.39) 
Oo @ %. 2 
bj by +++ Q 


where b; = B(j)1r is obtained by summation of the columns of B(j), the matrix 
consisting of the columns i € R; of B. The probability fiz, of absorption by class 
R,; when starting from 7 € T equals fig, the probability of ever visiting 7 when 
starting from 7, computed for the chain with transition matrix P*. 


EXAMPLE 9.4.6: SIBMATING. In the reproduction model called sibmating (sister- 
brother mating), two individuals are mated and two individuals from their offspring 
are chosen at random to be mated, and this incestuous process goes on through 
the subsequent generations. 


Denote by X,, the genetic type of the mating pair at the nth generation. Clearly, 
{Xn}n>0 is an HMC with six states representing the different pairs of genotypes 
AA x AA, aa x aa, AA x Aa, Aa x Aa, Aa x aa, AA X aa, denoted respectively 
1, 2, 3, 4, 5, 6. The following table gives the probabilities of occurrence of the 
three possible genotypes in the descent of a mating pair: 


AA Aa aa 


AA AA 1 0 0 

aa aa 0 0 1 

AA Aa 1/2 1/2 0 parents’ genotype 
Aa Aa 1/4 1/2 1/4 

Aa aa 0 1/2 1/2 

AA aa 0 1 0 


eS“ eo" 
descendant’s genotype 


The transition matrix of {X,,}n>0 is then easily deduced: 


1 
1 
p_| 1/4 1/2 1/4 
~ | 1/16 1/16 1/4 1/4 1/4 1/8 
1/4 1/4 1/2 


1 
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The set R = {1,2} is absorbing, and the restriction of the transition matrix to the 
transient set T = {3,4,5,6} is 


1/2 1/4 0 0 

_ {| 1/4 1/4 1/4 1/8 
Q=| 6 1/4 1/2 0 
0 1 0 O 


We find 
1468 4 1 
1 8 16 8 2 
_ _ Sti 
5=(1—Q) 6 4 8 16 1 }’ 
8 16 8 8 


and the absorption probability matrix is 


1/4 0 3/4 1/4 
1/16 1/16 12 4/2 
a i i ~ i a 
0 0 1/2 1/2 


For instance, the (3, 2) entry, 3, is the probability that when starting from a couple 
of ancestors of type Aa x aa, the race will end up in genotype aa x aa. 


9.5 The Markov Property on Graphs 


This section introduces the Markov fields on a graph, a notion of special interest 
in Physics. 


Let G = (V,€) bea finite graph, and let v; ~ vg denote the fact that (v1, v2) is 
an edge of the graph.® Such vertices are called neighbors (one of the other). One 
sometimes refers to vertices of V as sites. The boundary with respect to ~ of a set 
AC V is the set 


OA := {v € V\A; uv ~ w for some w € A}. 


Let A be a finite set, called the phase space. A random field on V with phases in 
A is a collection X = {X(v)} ,ev of random variables with values in A. A random 
field can be regarded as a random variable taking its values in the configuration 


6 Recall that, in the definition of an edge (v1, v2), v1 and v2 are distinct vertices. 
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space E := AY, where a configuration is a function 7: v € Vt a2(v) € A. Fora 
given configuration x and a given subset A C V, let 


x(A) := (a(v),u € A) 


denote the restriction of « to A. If V\A denotes the complement of A in V, one 
writes x = (x(A),2(V\A)). In particular, for fixed v € V, x = (a(v),x(V\v)), 
where V\v is a shorter way of writing V\{v}, the complement of the singleton {v} 


in V. 


Of special interest are the random fields characterized by local interactions. 
This leads to the notion of a Markov random field. The “locality” is in terms of 
the neighborhood structure inherited from the graph structure. More precisely, 
for any v € V, N, := {w € V;w ~ v} is the neighborhood of v. In the following, 
N,, denotes the set NU {v}. 


Definition 9.5.1 The random field X is called a Markov random field (MRF) 
with respect to ~ if for all sites v € V, the random elements X(v) and X(V\N,) 
are independent given X(N). 


In symbols: 
P(X(v) = 2(v) | X(V\v) = a(V\v)) = P(X(v) = 2(v) | XN) = 2M) (9.40) 


for all x € AY and all v € V. Property (9.40) is of the Markov type in the sense 
that the distribution of the phase at a given site is directly influenced only by the 
phases of the neighboring sites. 


Note that any random field is Markovian with respect to the trivial topology, 
where the neighborhood of any site v is V\v. However, the interesting Markov 
fields (from the point of view of modeling, simulation and optimization) are those 
with relatively small neighborhoods. 


EXAMPLE 9.5.2: MARKOV CHAIN AS MARKOV FIELD. The Markov property 
of a stochastic sequence {X,,}n>0 implies (Exercise 9.7.18) that for alln > 1, X,, is 
independent of (X;,,k ¢ {n—1,n,n+ 1}) given (Xn-1, Xn41). Calling n a vertex, 
X,, the value of the process at vertex n and the set {n—1,n+1} the neighborhood 
of vertex n, the above property can be rephrased as: For all n > 1, the value at 
vertex n is independent of the values at vertices k ¢ {n — 1,n,n+ 1} given the 
values in the neighborhood of vertex n. 
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Definition 9.5.3 The local characteristic of the MRF at site vu is the function 
mn’: AY — [0,1] defined by 


n(x) := P(X(v) =2(v) | X(M,) =2(M,)). 
The family {x }yey is called the local specification of the MRF. 


One sometimes writes 7’(x) := 7(x(v) | 2(N,)). 


Theorem 9.5.4 Two positive distributions of a random field with a finite config- 
uration space AY that have the same local specification are identical. 


Proof. Enumerate V as {1,2,...,K}. Therefore a configuration x € AY is 
represented as 7 = (%1,...,UK-1,%x) where 7; € A (1 <i< K). The following 
identity 


(2%; | 215 +++) M15 Yi41s-- YK) 
Laty te = —————————— oat * 
(21, 22; , 2k) ll (Yi Phan iayBeeas Helin yy Bee Yk) ( ) 


holds for any z,y € A*. For the proof, write 


K 


W(21, +++) Zi-1, Zin Yit 1 +++, YK) 
TZ) = ———— 
@) Oe meee TENTS (y) 


a 


and use Bayes’ rule to obtain for each 7 (1 <i < Kk): 


I 2iis 05 Bits Ss Wey ew os UR) _ W2¢ | Zige nny Mas Vebins<<9 UR) 


T(2Z1, vey i-15 Yas Yt ees YK) (Yi | 2156+ +5 M1; Yitls +++ YK) 


Let now z and z’ be two positive probability distributions on V with the same 
local specification. Choose any y € AY. Identity (x) shows that for all z € AY, 


Therefore = is a constant, necessarily equal to 1 since 7 and 7’ are probability 
1(z) 
distributions. 


Gibbs Distributions 


Consider the probability distribution 


1 o3 
a(x) = —-e TU) (9.41) 
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on the configuration space AY, where T > 0 is a “temperature”, U(x) is the 
“energy” of configuration x and Zr is the normalizing constant, called the partition 
function. Since 7(a) takes its values in [0,1], necessarily —oo < U(x) < +00. 
Note that U(a) < +00 if and only if mr(a) > 0. One of the challenges associated 
with Gibbs models is obtaining explicit formulas for averages, considering that it 
is generally hard to compute the partition function. (This is however feasible in 
exceptional cases; see Exercise 9.7.19.) 


Such distributions are of interest to physicists when the energy is expressed 
in terms of a potential function describing the local interactions. The notion of 
clique then plays a central role. 


Definition 9.5.5 Any singleton {v} C V is a clique. A subset C C V with more 
than one element is called a clique (with respect to ~) if and only if any two 
distinct sites of C are mutual neighbors. A clique C is called maximal if for any 
sitev EC, CU {vu} is not a clique. 


The collection of cliques will be denoted by C. 


Definition 9.5.6 A Gibbs potential on AY relative to ~ is a collection {Vo}ocyv 
of functions Ve : AY + RU {+oo} such that 


(i) Vo = 0 if C is not a clique, and 
(ii) for all x,a' € AY and allC CV, 


x(C) = 2'(C) = Vo(x) = Vo(2"). 


The energy function U is said to derive from the potential {Vc}ocyv if 


U(z) = S7Vo(z). 
Cc 


The function Vo depends only on the phases at the sites inside subset C’. One 
could write more explicitly Vc(«(C)) instead of Vc(x), but this notation will not 
be used. 


In this context, the distribution in (9.41) is called a Gibbs distribution (with 
respect to ~). 


EXAMPLE 9.5.7: ISING MODEL, TAKE 1. In statistical physics, the following 
model is regarded as a qualitatively correct idealization of a piece of ferromagnetic 
material. Here V = Z?, = {(i,j) € Z?, (1 < i,j < m)} and A = {+1,-1}, 
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where +1 is the orientation of the magnetic spin at a given site. The neighbor of 
a site consists of its four closest sites. The Gibbs potential is 


Voy(e) = —Za(v), 
Vouan(e) = ~Zalvielw), 


where (v,w) is the 2-element clique (v ~ w). For physicists, k is the Boltzmann 
constant, H is the external magnetic field, and J is the internal energy of an 
elementary magnetic dipole. The energy function corresponding to this potential 
is therefore 


The Hammersley—Clifford Theorem 


Gibbs distributions with an energy deriving from a Gibbs potential relative to 
a neighborhood system are distributions of Markov fields relative to the same 
neighborhood system. 


Theorem 9.5.8 If X is a random field with a distribution x of the form x(x) = 
Ze UV), where the energy function U derives from a Gibbs potential {Vo}ccy 
relative to ~, then X is a Markov random field with respect to ~. Moreover, its 
local specification is given by the formula 


SeeO) 


DEA en Vase Vo.2(V\v)) 7 


n(x) = (9.42) 


where the notation > )c5, means that the sum extends over the sets C that contain 
the site v. 

Proof. First observe that the right-hand side of (9.42) depends on x only through 
x(v) and «(N,). Indeed, Vc(#) depends only on (x(w),w € C), and for a clique 
C,ifw€C and v € C, then either w =v or w ~ v. Therefore, if it can be shown 
that P(X(v) = x(v)|X(V\v) = 2(V\v)) equals the right-hand side of (9.42), then 
(Theorem 2.1.14) the Markov property is proved. By definition of conditional 
probability, 


1(ax) 


P(X(v) = a(v) | X(V\v) = 2(V\v)) = yen TOA, 2(V\v)) | 


(t) 


But 
1 (a) = ur csv Vo(2)—Noge Ve (=) 
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and similarly, 
1 
r(A,2(V\0)) = er Dew Veta) Lege Vora), 
If C is a clique and v is not in C, then Vo(A, (V\v)) = Ve(x) and is therefore 


independent of \ € A. Therefore, after factoring out exp {- cx Vo(x)}, the 
right-hand side of (}) is found to be equal to the right-hand side of (9.42). 


The local energy at site v of configuration 2 is 
Ge) = ¥. Vale). 
C3u 
With this notation, (9.42) becomes 
e—Uv(2) 


ea e Uv (A,z(V\v)) * 


n° (a) 


EXAMPLE 9.5.9: ISING MODEL, TAKE 2. The local characteristics in the Ising 
model are 


dart Dwrway 2(w) +H }a(v) 


eter {I Luswa +E} 4 ge {I Cujwen 2w)tHF 


(x) = 


Theorem 9.5.8 above is the direct part of the Gibbs—Markov equivalence the- 
orem: A Gibbs distribution relative to a neighborhood system is the distribution 
of a Markov field with respect to the same neighborhood system. The converse 
part (Hammersley—Clifford theorem) is important from a theoretical point of view, 
since together with the direct part it concludes that Gibbs distributions and MRFs 
are essentially the same objects. 


Theorem 9.5.10 Let 7 > 0 be the distribution of a Markov random field with 
respect to ~. Then 
1 -v@) 
ie) = ze 
for some energy function U deriving from a Gibbs potential {Vo}ccy with respect 
to~. 
The proof is omitted,’ since in practice, the potential as well as the topology 


of V can be obtained directly from the expression of the energy, as the following 
example shows. 


” See for instance Theorem 10.1.11 of [4]. 
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EXAMPLE 9.5.11: MARKOV CHAINS AS MARKOV FIELDS. Let V = {0,1,...N} 
and A = E, a finite space. A random field X on V with phase space A is therefore a 
vector X with values in EN+!. Suppose that Xo,...,Xw is a homogeneous Markov 
chain with transition matrix P = {p,;}ij;em and initial distribution v = {1%}ien. 
In particular, with 7 = (x,...,2y), 


n(x) = Ve Prox ** *Pry_ rn ; 


that is, 
T(x) = oe 
where 
N-1 
U(x) = —log ry, — So log Disa) : 
n=0 


Clearly, this energy derives from a Gibbs potential associated with the nearest- 
neighbor topology for which the cliques are, besides the singletons, the pairs of 
adjacent sites. The potential functions are: 


Vio} (a) =~ log Vero; Vingn+1} (x) = 10g Penansi: 


The local characteristic at site n, 2<n< N —1, can be computed from formula 
(9.42), which gives 


exp (log Pin 12, + log Pentn+1 ) 


(a) = in at, 
yen exp(log pr, .y al 10g Pyan+1) 


that is, 
n(x) = aan enc 
Prn—-18n41 
where py is the general term of the two-step transition matrix P?. Similar compu- 


tations give 7°(x) and r(x). We note that, in view of the neighborhood structure, 
for2<n< N-1, X, is independent of Xo,..., Xn—2, Xn42,--., Xw given Xp_1 
and X44. 


9.6 Monte Carlo Markov Chains 


Let us return to the problem of generating a random variable with a given distri- 
bution (see Section 3.2). 
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Both the inverse method and the acceptance-rejection method apply in prin- 
ciple when Z is a discrete random variable with values in a finite space E = 
{1,2,...,r}. Denote by 7 the distribution of Z. The inverse method is in this 
case always theoretically feasible. It consists in generating a random variable U 
uniformly distributed on [0,1] and letting 7 = 7 if and only if PE OES le. 
S>/-, 7(2). When the size r of the state space EF is large, problems arise that are 
due to the small size of the intervals partitioning [0, 1] and to the cost of precision 
in computing. 


Another difficulty with the classical methods, besides the usual round-off errors, 
is that the probability 7 is in important applications known only up to a normal- 
izing factor, that is, 7 = K7, and then, the integral that gives the normalizing 
factor K is difficult or impossible to compute. In physics, this is frequently the 
case, because the partition function of a Gibbs distribution is usually unavailable 
in closed form. 


In random field simulation, another, maybe more important, reason is the 
necessity to enumerate the configurations, which implies coding and decoding of a 
mapping from the integers to the configuration space. The decoding part is usually 
very difficult and a small error may lead to a far-out sample (the configurations 
corresponding to close integers may be very different, which is a problem in image 
processing). 


The Monte Carlo Markov chain (MCMC) method for sampling a probability 
distribution 7 on the finite space FE partially avoids the problems just enumerated, 
but at the cost of obtaining only an approximate sample. 


The basic methodology is as follows. One constructs an irreducible aperiodic 
HMC {X,}n>0 with state space E admitting 7 as stationary distribution. Since E 
is finite, the chain is ergodic and therefore, for any initial distribution pu, 


lim POX, =1) =7@) Ge B) 


noo 


and for any non-negative function y: E > R, 


Jim, _ 2 (Xn) = Exle(X)] 


When n is “large,” we can consider that X, has a distribution “close” to 7. Of 
course, one would like to know how accurately X, imitates an /-valued random 
variable Z with distribution 7. For this we need estimates of the form 


|wP” —a| < Aa”, 


9.6. MONTE CARLO MARKOV CHAINS 367 


where a < 1. This issue will not be treated in this book, and only the basic 
problem, that of designing the MCMC algorithm, is considered. One looks for 
an ergodic transition matrix P on F whose stationary distribution is the target 
distribution a. There are infinitely many such transition matrices, and among 
them there are infinitely many that correspond to a reversible chain, that is, such 
that 


We seek solutions of the form 

Dig = Vij Oj (9.44) 
for 7 # i, where Q = {qj}ijex is an arbitrary irreducible transition matrix on 
FE, called the candidate-generating matrix: When the present state is 7, the next 
tentative state 7 is chosen with probability q,;. When j 4 i, this new state is 
accepted with probability a,;. Otherwise, the next state is the same state 7. Hence, 
the resulting probability of moving from i to 7 when i # 7 is given by (9.44). It 
remains to select the acceptance probabilities aj;. 


EXAMPLE 9.6.1: THE METROPOLIS ALGORITHM. In this example, the candidate- 
generation mechanism is purely random, that is, q;; = constant, and 


aij = min (1. au) 


EXAMPLE 9.6.2: BARKER'S SAMPLER. In the special case of a purely random 
selection of the candidate, 
(J) 


5 a) + GH) | 


In each case, the reversibility condition (9.43) is satisfied and therefore 7 is the 
stationary distribution (by Theorem 9.1.24). 
Simulation of Random Fields 


Consider a random field that changes randomly with time. In other words, we 
have a stochastic process {X,}n>0 where 


Xn = (X,(v),u € V) 
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and X,,(v) € A. The state at time n of this process is a random field on V with 
phases in A, or equivalently, a random variable with values in the state space 
E = AY, which for simplicity we assume finite. The stochastic process {X,,}n>0 
will be called a dynamical random field. 


Our purpose now is to show how a given random field with probability distri- 
bution 


1 
t) = se &) 9.45 
n(n) = Fe (9.45) 
can arise as the stationary distribution of a field-valued Markov chain. 


The Gibbs sampler uses a strictly positive probability distribution (q,,v € V) 
on V, and the transition from X, = © to X;,41 = y is made according to the 
following rule. 


The new state y is obtained from the old state x by changing (or not) the value 
of the phase at one site only. The site v to be changed (or not) at time n is chosen 
independently of the past with probability g,. When site v has been selected, the 
current configuration x is changed into y as follows: y(V\v) = 2(V\v), and the 
new phase y(v) at site v is selected with probability z(y(v) | <(V\v)). Thus, con- 
figuration x is changed into y = (y(s),7($\s)) with probability m(y(v) | «(V\v), 
according to the local specification at site v. This gives for the non-null entries of 
the transition matrix 


P(X =Y | Xp = x) = GT (y(v) | t(V\v))lyv\v)=2(V\w) : (9.46) 


The corresponding chain is irreducible and aperiodic if g, > 0 (v € V). To prove 
that 7 is the stationary distribution, we use the detailed balance test. For this, we 
have to check that for all 2, y € AY, 


7(x)P(Xn+i = y | Xn = 2) = (y)P(Xnt1 = 2 | Xn =), 
that is, in view of (9.46), for allu € V, 
7(x)qur(y(v) | y(V\v)) = w(y)qvr(x(v) | 2(V\v)). 
But the last equality is just 


mye) (VY) ay ety — 
(2) BX (Vin) = a(V io) = n(y(v), 2(V\ ) EXT) =a(V\v))’ 


EXAMPLE 9.6.3: SIMULATION OF THE ISING MODEL. The local specification 
at site v depends only on the local configuration x(NV;,). Note that small neigh- 
borhoods speed up computations. Note also that the Gibbs sampler is a natural 
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sampler, in the sense that in a piece of ferromagnetic material, for instance, the 
spins are randomly changed according to the local specification. When nature 
decides to update the orientation of a dipole, it does so according to the law of 
statistical mechanics. It computes the local energy for each of the two possible 
spins, Fy = E(+1,2(N,)) and E_ = E(-1,2(N,)), and takes the corresponding 
orientation with a probability proportional to e”+ and e-, respectively. 


EXAMPLE 9.6.4: GIBBS SAMPLER FOR RANDOM VECTORS. Clearly, Gibbs 
sampling applies to any multivariate probability distribution 


m(a(1),...,2(N)) 
on a set E = AN, where A is countable (but this restriction is not essential). 


The basic step of the Gibbs sampler for the multivariate distribution 7 con- 
sists in selecting a coordinate number i (1 <i < N) at random, and then choos- 
ing the new value y(7) of the corresponding coordinate, given the present values 
x(1),...,a(@—-1),2(¢@+1),...,2(V) of the other coordinates, with probability 


m(y(2) | e(1),...,2@—1),2(@@+1),...,2()). 


One checks as above that 7 is the stationary distribution of the corresponding 
chain. 


The Propp—Wilson Algorithm 


We now present the basic idea of a theoretical method for obtaining an exact 
sample of a given distribution 7 on a finite state space E, that is, a random 
variable Z such that P(Z = 1) = r(i) for alli € E. The following algorithm, the 
Propp—Wilson algorithm, is based on a coupling idea. One starts from an ergodic 
transition matrix P with stationary distribution 7, just as in the classical MCMC 
method. 


The algorithm is based on a representation of P in terms of a recurrence equa- 
tion, that is, for given a function f and an IID sequence {Z,,}n>1 independent of 
the initial state, the chain satisfies the recurrence 


Xn4 = f (Xa, Zn+1) : (9.47) 


The algorithm constructs a family of HMCs with this transition matrix with the 
help of a unique IID sequence of random vectors {Yp}nez, called the updating 
sequence, where Y, = (Znii(1),-++ , Zn41(r)) is an r-dimensional random vector, 
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and where the coordinates Z,+41(7) have a common distribution, that of 2. For 
each N € Z and each k € E, a process {X)\(k)}nsw is defined recursively by: 


Xy(k) =k, 


and, forn > N, 
Xmer(k) = F(X (A), Znta(Xn (k)) - 


(Thus, if the chain is in state 7 at time n, it will be at time n+ 1 in state j = 
f(t, Zn41(%).) Each of these processes is therefore an HMC with the transition 
matrix P. Note that for all k,@ € E, and all M,N € Z, the HMcs {X(k)} nsw 
and {X(¢)}nsa use at any time n > max(M,N) the same updating random 
vector Yj41. 


If, in addition to the independence of {Y;}nez, the components Z,41(1), 
Zn+i(2), ..-; Zn4i(r) are, for each n € Z, independent, we say that the updating 
is componentwise independent. 


Definition 9.6.5 The random time 
rt = inf{n > 0; XR(1) = X,(2) =--- = Xz (r)} 
is called the forward coupling time (Figure 9.2). The random time 
rT =inf{n > 1; Xp"(1) = Xo"(2) = ++ = Xo "(r)} 


is called the backward coupling time (Figure 9.2). 


ae 
5 
* 4 
3 
2 
* 1 
= 1 ar 
7 -6 —5 -4 —-3 -—2 -1 0 O 41 42 43 +4 
T =7 Tt =4 


Figure 9.2: Backward and forward coupling 


Thus, 7* is the first time at which the chains {X?(7)}nso, 1 <i <r, coalesce. 
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Lemma 9.6.6 When the updating is componentwise independent, the forward cou- 
pling time t* is almost surely finite. 


Proof. Consider the (immediate) extension of Theorem 9.3.8 to the case of r 
independent HMCs with the same transition matrix. It cannot be applied directly 
to our situation, because the chains are not independent. However, the probability 
of coalescence in our situation is bounded below by the probability of coalescence 
in the completely independent case. To see this, first construct the independent 
chains model, using r independent IID componentwise independent updating se- 
quences. The difference with our model is that we use too many updates. In order 
to construct from this a set of r chains as in our model, it suffices to use for two 
chains the same updates as soon as they meet. Clearly, the forward coupling time 
of the so modified model is smaller than or equal to that of the initial completely 
independent model. 


For a simpler notation, let 7~ := 7. Let 
Z= Xo" (i). 


(This random variable is independent of 7. In Figure 9.2, Z = 2.) Then, 


Theorem 9.6.7 With a componentwise independent updating sequence, the back- 
ward coupling time T is almost surely finite. Also, the random variable Z has the 
distribution 7. 

Proof. We shall show at the end of the current proof that for all k € N, 
P(r < k) = P(7t < k), and therefore the finiteness of 7 follows from that of 
7* proven in the last lemma. Now, since for n > 7, X9"(i) = Z, 


P(Z=j)=P(Z=),7 >n)+ P(Z =3,7 <n) 
=P(Z=),7 >n)+ P(X) "(i) =5,7 <n) 
=P(Z=j,7 >n) — P(Xo"(t) = 9,7 > n) + P(X (i) = 9) 
=P(Z=j,7T >n)— P(Xo"(i) = 9,7 > n) + pij(n) 
= A, — Bn + pis ( I 


But A, and B, are bounded above by P(r > n), a quantity that tends to 0 as 
nt oo since T is almost surely finite. Therefore 


P(Z=j)= lim pis(n) =7(j). 


It remains to prove the equality of the distributions of the forwards and backwards 
coupling time. For this, select an arbitrary integer k € N. Consider an updating 


372 CHAPTER 9. MARKOV CHAINS 


PNmwWH OT 


a 


4 | 
7 -6 —5 —4 -3 -2 -1 0 O +1 4+2 +3 +4 +5 46 +7 
Yo 1m 3% % & % % %  &  % Mm % Me 


Figure 9.3: tt < k implies 7’ < k 


sequence constructed from a bona fide updating sequence {Y,}nez, by replacing 
Y_pai, Y-e+2;---;Yo by Yi, Yo,-.-, Y¥~. Call 7’ the backwards coupling time in the 
modified model. Clearly 7 and 7’ have the same distribution. 

Suppose that +> < k. Consider in the modified model the chains starting at 
time —k from states 1,...,7. They coalesce at time —k+7+t <0 (see Figure 9.3), 
and consequently tT’ < k. Therefore 7+ < k implies 7’ < k, so that 


P(t? <k)< P(r <k)=P(r <k). 
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Figure 9.4: tr’ < k implies tt <k 


Now, suppose that 7’ < k. Then, in the modified model, the chains starting 
at time k — 7’ from states 1,...,r must at time —k +77* < 0 coalesce at time k. 
Therefore (Figure 9.4), 7+ < k. Therefore 7’ < k implies t+ < k, so that 


P(r <k)=P(r' <k) < P(rt <k). 
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Note that the coalesced value at the forward coupling time is not a sample of 
m (see Exercise 9.7.21). 


The above exact sampling algorithm is often prohibitively time-consuming 
when the state space is large. However, if the algorithm required the coalescence 
of two, instead of r processes, then it would take less time. The Propp and Wilson 
algorithm does this in a special, yet not rare, case, which we now describe. 


It is now assumed that there exists a partial order relation on FE, denoted by 
~<, with a minimal and a maximal element (say, respectively, 1 and r), and that 
we can perform the updating in such a way that for all i,7 € E, all N € Z, and 
alln > N, 

ix j> XN) 3X20). 


However we do not require componentwise independent updating (but the updat- 
ing vectors sequence remains IID). The corresponding sampling procedure is called 
the monotone Propp—Wilson algorithm. 


Define the backwards monotone coupling time 


Tm = inf{n > 1;X6"(1) = X6"(r)}. 


Pm WwW eS ol 


Figure 9.5: The Monotone Propp—Wilson algorithm 


Theorem 9.6.8 The monotone backwards coupling time Tm is almost surely finite. 
Also, the random variable X)™(1) = X94 7™(r) has the distribution 1. 


Proof. We can use most of the proof of Theorem 9.6.7. We need only to prove 
independently that 7* is finite. It is so because T* is dominated by the first time 
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n > 0 such that X°(r) = 1, and the latter is finite in view of the recurrence 
assumption. 


Monotone coupling will occur with representations of the form (9.47) such that 
for all z, 
ix j= fli,z) Sf), 
and if for all n € Z, alli € {1,...,r}, 


Zn+1(t) = Zn+1 . 


EXAMPLE 9.6.9: A DAM MODEL. We consider the following model of a dam 
reservoir. The corresponding HMC, with values in F = {0,2,...,7r}, satisfies the 
recurrence equation 


Xnit = min{r, max(0, Xn + Znrsi)}, 


where, as usual, {Z,},>1 is UD. In this specific model, X,, is the content at time 
n of a dam reservoir with maximum capacity r, and Zn4, = Anyi —c, where An+) 
is the input into the reservoir during the time period from n to n+ 1, and c is the 
maximum release during the same period. The updating rule is then monotone. 


In practical implementations, instead of trying the times —1, —2, etc., one 
may use successive starting times of the form a”7>. Let k be the first k for which 
a®Ty > r_. The number of simulation steps used is 2 (To taTo+t:.++ a®Tp) (the 
factor 2 accounts for the fact that we are running two chains), that is, 


k+l _ oy 2 2 
om, (2 | < am ( “— |) oF 1 < a7. 
a-l a-l a-l 


steps, where we have assumed that 7) < 7_. In the best case, supposing we are 
informed of the exact value of T_ by, some oracle, the number of steps is 27_. The 
ratio of the worst to best cases is “—, which is minimized for a = 2. This is why 
it is usually suggested to start the s Sureeeeive attempts of backward coalescence at 
times of the form —2*Ty (k > 0). 


9.7 Exercises 


Exercise 9.7.1. A COUNTEREXAMPLE 
Find a simple example of an HMC {X,}n>0 with state space E = {1,2,3,4,5, 6} 
such that 


P(X = 6| X1 € {3,4}, Xo = 2) # P(X2 = 6| Xi € {3,4}). 


9.7. EXERCISES 375 


Exercise 9.7.2. PAST, PRESENT, FUTURE 
For an HMC {X;,}n>0 with state space E, prove that for all n € N, and all states 
10,21, ope int 1, Js J25- os Ik € E, 


P(Xn41 = jiy---;Xntk = je | Xn = 1, Xn—1 = in-1,---; Xo = to) 


= P(Xnsi = jiy---) Xntk = Je | Xn = 7). 


Exercise 9.7.3. GIVEN ADJACENT STATES 

Let {Xn }n>0 be an HMC with state space F and transition matrix P. Show that 
for alln > 1 and all k > 2, X,, is conditionally independent of Xo, ..., Xn—2, 
Xnp2, +--+) Xntp given Xp—-1, Xn41. Compute the conditional distribution of X, 
given Xy-1, Xn41. 


Exercise 9.7.4. STREET GANGS 

Three characters, A, B, and C, armed with guns, suddenly meet at the corner of a 
Washington D.C. street, whereupon they naturally start shooting at one another. 
Each street gang kid shoots every tenth second, as long as he is still alive. The 
probabilities of a hit for A, B, and C are a, (, and ¥ respectively. A is the most 
hated, and therefore, as long as he is alive, B and C ignore each other and shoot 
at A. For historical reasons not developed here, A cannot stand B, and therefore 
he shoots only at B while the latter is still alive. Lucky C is shot at if and only if 
he is in the presence of A alone or B alone. What are the survival probabilities of 
A, B, and C, respectively? 


Exercise 9.7.5. THE GAMBLER’S RUIN 
(This exercise continues Example 9.1.10.) Compute the average duration of the 
game when p = $ 


Exercise 9.7.6. RECORDS 

Let {Z,}n>1 be an IID sequence of geometric random variables: For k > 0, 
P(Z, = k) = (1—-p)*p, where p € (0,1). Let X, = max(Z,,...,Z,) be the 
record value at time n, and suppose Xo is an integer-valued random variable in- 
dependent of the sequence {Z,}n>1. Show that {X,}n>0 is an HMC and give its 
transition matrix. 


Exercise 9.7.7. AGGREGATION OF STATES 

Let {Xn}ns0 be a HMC with state space E& and transition matrix P, and let 
(Ay, & > 1) bea countable partition of EF. Define the process i‘. ee, with state 
space E = {1,3,...} by X, = = k if and only if X, € Ax. Show that if je A, Ps 
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is independent of 7 € A, for all k,@, then 1s is an HMC with transition 
probabilities pgg = D0 ,c4, Pis (any i € Ax). 


Exercise 9.7.8. TRUNCATED HMC 

Let P be a transition matrix on the countable state space FE, with the positive 
stationary distribution 7. Let A be a subset of the state space, and define the 
truncation of P on A to be the transition matrix Q indexed by A and given by 


Gy = Dy if i,7 € At AG and gi =Dut D Die- 
keA 


Show that if (P,7) is reversible, then so is (Q, aa): 


Exercise 9.7.9. MOVING STONES 

Stones S),..., Sy are placed in line. At each time n a stone is selected at random, 
and this stone and the one ahead of it in the line exchange positions. If the 
selected stone is at the head of the line, nothing is changed. For instance, with 
M = 5: Let the current configuration be 52535 5554 (S2 is at the head of the 
line). If Ss is selected, the new situation is $253,555.54, whereas if S» is selected, 
the configuration is not altered. At each step, stone S; is selected with probability 
a; > 0. Call X, the situation at time n, for instance X, = Sj, ---5;,,, meaning 
that stone Si; is in the jth position. Show that {X,}n>0 is an irreducible HMC 
and that it has a stationary distribution given by the formula 

(Si, +++ Siy,) = CoMaM.-- ay, , 


a 


for some normalizing constant C. 


Exercise 9.7.10. NO STATIONARY DISTRIBUTION 
Show that the symmetric random walk on Z cannot have a stationary distribution. 


Exercise 9.7.11. AN INTERPRETATION OF INVARIANT MEASURE 

A countable number of particles move independently in the countable space F, 
each according to a Markov chain with the transition matrix P. Let A,(i) be the 
number of particles in state i € F at time n > 0, and suppose that the random 
variables Ao(i), i € FE, are independent Poisson random variables with respective 
means p(i), i € EL, where w = {(i)}iem is an invariant measure of P. Show that 
for all n > 1, the random variables A,,(7), i € FE, are independent Poisson random 
variables with respective means p(i), 7 € E. 


Exercise 9.7.12. RETURN TIME TO THE INITIAL STATE 
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Let 7 be the first return time to initial state of an irreducible positive recurrent 
HMC {Xn}n>0; that is, 

7 =inf{n>1;X, = Xo}, 
with tT = +00 if X, # Xo for all n > 1. Compute the expectation of 7 when the 


initial distribution is the stationary distribution 7. Conclude that it is finite if and 
only if F is finite. 


Exercise 9.7.13. THE SNAKE CHAIN 

Let {X,}n>0 be an HMC with state space E and transition matrix P. Let for 
L 2 1, Vi = (Ang Most eee Xn+b): 

(a) The process {Y;}n>0 takes its values in F = E4*1. Prove that {Yn}nso is an 


HMC and give the general entry of its transition matrix. (The chain {Y,}n>o is 
called the snake chain of length L + 1 associated with {X,}n>0.) 


(b) Show that if {X,,}n>0 is irreducible, then so is {Y,,}n>0 if we restrict the state 
space of the latter to be F = {(io,...,i2) € E°*1; Digi: Pirin *** Diz_yi, > OF. Show 
that if the original chain is irreducible aperiodic, so is the snake chain. 


c) Show that if Xn n>0 has a stationary distribution 7, then Yu n>0 also has a 
’ 
stationary distribution. Which one? 


Exercise 9.7.14. PRODUCT MARKOV CHAIN 

Let tsa and CE se be two independent irreducible and aperiodic HMCs 
with the same transition matrix P. Define the product HMC {Z,}n>0 taking its 
values in EB x E by Z, = xy, x: Prove that it is indeed a HMC. What is its 
n-step transition matrix? Prove that it is irreducible. Give a counterexample if 
the hypothesis of aperiodicity is omitted. 


Exercise 9.7.15. PROOF OF LEMMA 9.2.18 
Prove Lemma 9.2.18. 


Exercise 9.7.16. IID RANDOM FIELDS 
A. Let (Z(v),v € V) be a family of 11D random variables indexed by a finite set 
V, with P(Z(v) = —1) = p, P(Z(v) = +1) = ¢ = 1—p. Show that 


P(Z — 2) = Kel Yvev 28) 
for some constants y and K to be identified. 


B. Do the same with P(Z(v) = 0) =p, P(Z(v) = +1) =q=1-p. 


Exercise 9.7.17. TWO-STATE HMC AS GIBBS FIELD 
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Consider an HMC {X,,}n>0 with state space E = {—1,1} and transition matrix 


l-a Q 
P=(5 4) 


where a, 3 € (0,1), and with the stationary initial distribution 


1 


ac 


(Yo, 1) = 
Give a representation of Z := (Xo,...,Xy) as a Markov random field, that is, 
give its local characteristics. 


Exercise 9.7.18. MARKOV CHAIN AS MARKOV FIELD 
Let {Xn}n>0 be an HMC. Prove that for all n > 1, X, is independent of (X;,,k ¢ 
{n—1,n,n+1}) given (X,-1, Xn41)- 


Exercise 9.7.19. ISING ON THE TORUS 

Consider the classical Ising model of Example 9.5.7, except that the site space 
V = {1,2,...,N} consists of N points arranged in this order on a circle. The 
neighbors of site i are i+1 and i—1, with the convention that site N +1 is site 1. 
The phase space is A = {+1, —1}. Compute the partition function. Hint: express 
the normalizing constant Zy in terms of the N-th power of the matrix 


w= (R22 ACL) = (Co Se): 


Exercise 9.7.20. MONOTONICITY OF THE GIBBS SAMPLER 

Let ys be an arbitrary probability measure on AY and let v be the probability 
measure obtained by applying the Gibbs sampler at an arbitrary site v € V. Show 
that dy(v,m) < dy(p, 7). 


Exercise 9.7.21. A COUNTEREXAMPLE 

Let Z* be the common value of the coalesced chains at the forwards coupling time 
7? for the usual two-state ergodic HMC. Is the distribution of Z* the stationary 
distribution? 


Exercise 9.7.22. APERIODICITY 
a. Show that an irreducible transition matrix P with at least one state i € # such 
that p;; > 0 is aperiodic. 
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b. Let P be an irreducible transition matrix on the finite state space E. Show 
that a necessary and sufficient condition for P to be aperiodic is the existence of 
an integer m such that P™ has all its entries positive. 


c. Consider an HMC that is irreducible with period d > 2. Show that the restriction 
of the transition matrix to any cyclic class is irreducible. Show that the restriction 
of P? to any cyclic class is aperiodic. 


Exercise 9.7.23. DOUBLY STOCHASTIC TRANSITION MATRIX 

A stochastic matrix P on the state space EF is called doubly stochastic if for all 
states 7, yen Pai = 1. Suppose in addition that P is irreducible, and that EF 
is infinite. Find the invariant measure of P. Show that P cannot be positive 
recurrent. 


Exercise 9.7.24. RETURNS TO A GIVEN SET 

Let {X,}n>0 be an HMC on the state space £ with transition matrix P. Let {7,}4>1 
be the successive return times to a given subset F C FE. Assume these times are 
almost surely finite. Let Xo = 0 € F, and define Y,, = X(7,). Show that {Y;}n>0 
is an HMC with state space F’. 


Exercise 9.7.25. NULL RECURRENCE OF THE 2-D SYMMETRIC RANDOM WALK 
Show that the 2-D symmetric random walk on Z? is null recurrent. 


Exercise 9.7.26. TRANSIENCE OF THE 4-D SYMMETRIC RANDOM WALK 
Show that the projection of the 4-D symmetric random walk on Z? is a lazy 
symmetric random walk on Z?. Deduce from this that the 4-D symmetric random 
walk is transient. More generally, show that the symmetric random walk on Z?, 
p = 5, is transient. 


Exercise 9.7.27. COUPLING TIME 

Let P be an ergodic transition matrix on the finite state space EF. Prove that for 
any initial distributions js and v, one can construct two HMCs {X,,}n>o0 and {Yn} n>0 
on F with the same transition matrix P, and the respective initial distributions ju 
and v, in such a way that they couple at a finite time 7 such that E[e°7] < oo for 
some a > 0. 


Exercise 9.7.28. THE LAZY RANDOM WALK ON THE CIRCLE 

The state space F := {0,1,..., N — 1} consists of a succession of N equidistant 
points on a circle in such a way that two points 7 and j such that 7 = 71+1 modulo 
N are neighbors. Consider the Markov chain {(Xn, Yn)}n>0 with state space EF x E 
and representing two particles moving on E as follows. At each time n choose X,, 
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or Y,, with probability $ and move the corresponding particle to the left or to the 
right, equiprobably while the other particle remains still. The initial positions of 
the particles are a and b respectively. Compute the average time it takes until the 
two particles collide (the average coupling time of two lazy random walks). 


Exercise 9.7.29. COUPLING TIME FOR THE 2-STATE HMC 
Find the distribution of the first meeting time of two independent HMCs with state 
space £ = {1,2} and transition matrix 


l-—a a 
P= : 
( B 1- 3) 
where a, 6 € (0,1), when their initial states are different. 


Exercise 9.7.30. EXTENSION TO NEGATIVE TIMES 

Let {X;,}n>0 be an HMC with state space FE, transition matrix P, and suppose 
that there exists a stationary distribution 7 > 0. Suppose moreover that the 
initial distribution is 7. Define the matrix Q = {qj}ijex by (9.7). Construct 
{X_n}n>1, independent of {X,}n>1 given Xo, as follows: 


P(X_1 = 11, X_2 =ite,..., X_% in| Xo (CM = iy a) 
= P(X, = 41, X_o = to,..., Xe = te | Xo = 1) = Wii, Wisin * + Cig_zig 


for all k > 1,n > 1,t,%1,...,¢n,91,---;Jn € E. Prove that {X,}nez is an HMC 
with transition matrix P and P(X, = 7) = x(¢), for alli € FE, alln € Z. 


Check for 
updates 


Chapter 10 


Poisson Processes 


Poisson processes are particular types of random point processes. A random point 
process on the line (resp. in space) is, roughly speaking, a countable random set 
of points of the real line (resp. in some space '). 


In most applications to engineering and operations research, a point of a point 
process on the line is the time of occurrence of some event, and this is why points 
are also called events. For instance, the arrival times of customers at the desk of a 
post office or of jobs at the central processing unit of a computer are point process 
events. In biology the time of birth of an organism and in physiology the firing 
time of a neuron are events. In applications to ecology, a point of a spatial point 
process could be the location of a tree in a forest, or of a source of pollution. In a 
communications context, it may represent the position of a cellphone or of a relay 
antenna. 


10.1 Poisson Processes on the Line 


This section introduces the homogeneous Poisson process, the simplest example of 
a random point process on the line. 


Definition 10.1.1 A random point process on the line is a sequence {Ty}nez of 
real random variables such that, almost surely, 


(20, = Fe ae ao, Ke, ond 


(ii) limjnjtoo Tn| = +00. 


The usual definition of a random point process is less restrictive. In particu- 
lar, condition (i) is relaxed in the more general definition, where multiple points 


' Tn this book, R™ for m > 2. 
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(simultaneous arrivals to a ticket booth, for instance) are allowed. When condition 
(i) holds, one speaks of a simple point process. Also, condition (ii) is not required 
in the more general definition which allows with positive probability an explosion, 
that is, an accumulation of events in finite time. However, conditions (i) and (ii) 
fit the special case of homogeneous Poisson processes, the center of interest in this 
section. 


The sequence {T;, — Tr—-1}nez is called the inter-event sequence or, in the ap- 
propriate context, the inter-arrival sequence. For any interval (a, }] in R, 


N((a,8]) = $0 Vas (Tn) 


neZ 


is an integer-valued random variable counting the events occurring in the time 
interval (a,b]. For typographical simplicity, it will be occasionally denoted by 
N(a,}], omitting the external parentheses. If t > 0, we sometimes let N(t) := 


N(0,¢]. 


Since the interval (a, t] (¢ > 0) is closed on the right, the trajectories (or sample 
paths) t+ N((a, t],w) are right-continuous. They are non-decreasing, have limits 
on the left at every time t and jump one unit upwards at each event of the point 
process. 

The family of random variables N := {N(a,}]}(a,tjcr is called the counting 
process of the point process {T;, }nez. Since the sequence of events can be recovered 
from N, the latter also receives the appellation “point process.” 


The Counting Process of an HPP 


There exist several equivalent definitions of a Poisson process. The one adopted 
here is the most practical. 


Definition 10.1.2 A point process N on the positive half-line is called a homo- 
geneous Poisson process (HPP) with intensity A > 0 if 


(a) for all times t ER (1 <i <k) such that t) < te < +--+ < ty, the random 
variables N(t;, tisi] (1 <i <k) are independent, and 


(8) for any interval (a,b] CR, N(a, 6] is a Poisson random variable with mean 


A(b— a). 


In particular, 
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and 


E|N(a, }]] = A(b— a). 
In this sense, \ is the average density of points. 


Condition (a) is the property of independence of increments of Poisson pro- 
cesses. It implies in particular that for any interval (a, b], the random variable 
N(a,}] is independent of (N(c,d] (c < d < a). For this reason, Poisson processes 
are sometimes called memoryless. A more precise statement is “the increments of 
homogeneous Poisson processes have no memory of the past”. 


The definition adopted for random point processes does not allow for multiple 
points or explosions. But suppose it did. It turns out that requirements (a) and 
(8) in Definition 10.1.2 suffice to prevent such occurrences. 


A proof is as follows. Since E[N(a)] = Aa < oo, N(a) < co almost surely. 
Since this is true for all a > 0, lim, T;, = oo almost surely. Simplicity will follow 
from P(D(a)) = 0 for all a > 0, where 


D(a) := {there exists multiple points in (0,a]}. 
We prove this for D = D(1) (without loss of generality). The event 


ie et 
Dn i= {y(o 2] > 2 for some i (1 <i 2" —1)} 


decreases to D as n tends to infinity and therefore, by the monotone sequential 
continuity of probability, 


PDs lim PUWO,) = 1 = lt PD, ) 
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27-1 
= [[e?7a4+)2) =e% 42)”. 


i=0 


The limit of the latter quantity is 1 as n t oo, and therefore, P(D) = 0. 


Let 
S,:= 7, and S, := T, — Tr_i(n > 2). 


Theorem 10.1.3 The sequence {S,}n>1 of an HPP with intensity \ > 0 is ID 
with a common exponential distribution of parameter X. 
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The cumulative distribution function of an arbitrary inter-event time is there- 
fore 
P(S, <t)=1-e™. 
Recall that 
lela, ae 
that is, the average number of events per unit of time equals the inverse average 
inter-event time. 


Proof. Suppose we can show that for any n > 1, the random vector T := 
(T,,...,Z,) admits the probability density function 


frltay.--ste) = Me™ Leth, stn); (10.1) 
where C':= {(t1,...,tn);0< ti <-++< ty}. Since 
S,=Th, 09 = To: Tyee s; foe tee Eee 


the formula of smooth change of variables gives for the probability density function 


of S = (S),..., Sn) 
fs(s1, “- -; Sn) = fr(s1, 81 =+ SQ,04+551 a Sn) = [Qe Lee.503} : 
i=1 


It remains to prove (10.1). 


The proof that we now give is somewhat heuristic (and we let the reader dis- 
cover why) but most convincing. 


The probability density function of T at t = (t),...,tn) is obtained as the limit 
as hy,...,hn € R, tend to 0 of the quantity 


P(MaLT € (ti, ti + Pal }) 


10.2 

Tes hj ? ( ) 

where it suffices to consider those (t1,...,t,) inside C' since the points 7,...,Th 
are strictly ordered in increasing order. For sufficiently small h,,...,hn, the 


event ML{T; € (t,t; + hi]} is the intersection of the events {N(0,t,] = 0}, 
APL N (ts, ts + hi] = 1, N(t; + ha, tiga] = O} and {N (tn, tn + hn] > 1}, and there- 
fore the numerator of (10.2) equals 


P(N(0, t:] = 0) (I P(N (t,t + Ai] = 1, N(G + hi, tiga] = ») Xree 


n—-1 
_ eal II (ec Ahi Nhe Nti41—-ti h)) (1 _ e Min) _ Arle Ain oe Heit = etn), 


i=1 
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Dividing by h,---h, and taking the limit as hy,...,h, tend to 0, we obtain 
Men | 


Competing Poisson Processes 


Let {7} }n51 and {T?}n>1 be two independent HPPs on R, with respective intensi- 
ties A; > 0 and Ay > 0. Their superposition is defined to be the sequence {T;,}n>1 
formed by merging the two sequences {T7},>1 and {T?},1 (see Figure 10.1). We 
shall prove that 


(i) the point processes {T}},>1 and {T?}n>1 have no points in common, and 


(ii) the point process {T,,}n>1 is an HPP with intensity \ = Ay + Ap. 


T! T; 73 Ty 


t=0T7 T 13% Ts Te T, Ts 
Figure 10.1: Superposition, or sum, of two point processes 


Indeed, defining N by 
N(a, b| := Ni (a, b] + No(a, b}, 


we see that condition (a) of Definition 10.1.2 is satisfied, in view of the indepen- 
dence of N; and N32. Also, N(a,b] being the sum of two independent Poisson 
random variables of mean A,(b — a) and A2(b — a) is a Poisson variable of mean 
A(b — a) where \ = Ay + Az, and therefore, condition (3) of Definition 10.1.2 is 
satisfied. This proves (ii). But N is simple, and therefore (i) is true. 


The above result can be extended to several — possibly infinitely many — ho- 
mogeneous Poisson processes as follows: 
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Theorem 10.1.4 Let {N;i}is1 be a family of independent HPPs with respective 
positive intensities {A;}i>1. Then, 


(i) two distinct HPPs of this family have no points in common, and 
(iy af = eA 00, then Nite yo WV (EG = 0) defines the counting 
process of an HPP with intensity 2. 


Proof. Assertion (ii) has already been proven. Observe that for all t > 0, N(¢) is 
almost surely finite since 


ioe} 


EIN(t)] = 50 BIN, (t)] = (>: ») t<oo. 


i=l 


In particular, N(a,b] is almost surely finite for all (a,b] C Ry. The proof of lack 
of memory of N is the same as in the case of two superposed Poisson processes. 
Finally, N(a, }] is a Poisson random variable of mean \(b — a) since 


P(N(a,b] =k) = limP (>: Ni(a, }] = ) 


—(SO, Aa (b—a) a (— a)|* 
n‘too k! 

en Ab-a) [A(b — a)|* 

k! ; 


The next result is called the competition theorem because it features HPPs 
competing for the production of the first event. 
Theorem 10.1.5 In the situation of Theorem 10.1.4, where \ := \o*, Xi < 00, 
denote by Z the first event time of N = >, N; and by J the index of the HPP 
responsible for it (Z is the first event of Nj). Then 


Ad =14 20) = Ad =VINZ SE a) = =e. (10.3) 


In particular, J and Z are independent, P(J =i) = _ and Z is exponential with 


mean ~*. 
Proof. 


A. We first prove the result for a finite number of Poisson processes. We have 
to show that if X,,...,X« are K independent exponential variables with means 
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qT 
e o—* e Ni 
Ty 
e e e No 
qT} 
° ° e e e N3 
qT 
o—e e N4 
Z=T3, J=3 


Competition among four point processes 


A) aes Age and if Jx is defined by Xj, = ZK := ZK := inf(X1,..., XK), then 


ri 


iC! al 2 a) Lae eer 
leno 


exp{—(A1 +--+ + Anja}. (x) 


First observe that 


K 
P(ZK = a) = Pt iy > a}) = Wkses > a) = I[e** = eT Otte +AK)a : 


j=l j=l 
Letting U := inf(X2,..., XK), we have 
PUr=1,Z¢2>0) = Pla<xj eu) 
= [ P(U > z)A\ye* dx = i e Cate tn) etde 


a a 
= At eT Ort +AK)a 


Apter +AK 
This gives (x). Letting a = 0 yields P(Jx = 1) = oe This, together with 


(x) and the expression for P(Zx > a) gives (10.3), for 7 = 1, without loss of 
generality. 


B. Suppose the result true for a finite number of HPPs. Since the event 
{Jk =1,ZK > a} decreases to {J = 1,7 > a} as K + 00, we have 


P(J=1,Z7> a) a P(Jx =1,ZK >a), 


from which (10.3) follows, using the result of part A of the proof. 
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10.2 Generalities on Point Processes 


A few definitions concerning general point processes are in order. 


Let A be an arbitrary “dummy” element not in R™. Let €, be the Dirac 
measure at a if a € R™, the null measure if a = A. 


Definition 10.2.1 Let {Xn}nen is a sequence of random variables with values in 
R”U{A}. The collection {Xn}nen ts called a point process on R™, and the X,,’s 
are the points of this point process. 


This point process may be represented by the (random) measure 


N=) ex,. (10.4) 


nen 


In particular, 
N(C) = So 1o(Xn) 
nen 
counts the number of points in C € 6(IR™). The A element plays the role of oo 
(“a point that does not exist”). Note that it may occur that for some of the values 
in the list {X,}nen are the same, thus allowing for multiple points. 


Definition 10.2.2 The point process N is called simple if almost surely 
N(w)({c}) <1 for alle e R™. 


Definition 10.2.3 The point process N is called locally finite if P(N(C) < co) = 
1 for all bounded measurable sets CC R™. 


Definition 10.2.4 The locally finite point process N is called a first-order point 
process if E[N(C)] < oo for all bounded measurable sets C € R™. 


EXAMPLE 10.2.5: THE BINOMIAL POINT PROCESS. This point process on R™ 
has a (finite number) of points, T, where T is a binomial random variable of size 
n and parameter p € (0,1): 


Pr=h)= (2 \oKa—py* (O<kSn). 
If 7 =k, the k points are located independently of one another on R™ according 
to the same probability distribution Q. It is locally finite. It is simple if and only 
if Q is non-atomic. It is a first-order point process because T is integrable. 
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Definition 10.2.6 Let N be a point process on R™ and let P be a probability 
measure on (Q,F). The distribution of N consists of the probability distributions 
of the vectors (N(C,),...,N(Cx)) (K > 1, G,...,Cxe € B(R™)). 


EXAMPLE 10.2.7: THE POISSON PROCESS. Let v be a o-finite measure on R”. 
The point process N on R™ is called a Poisson process on R™ with intensity 
measure v if 


(i) for all finite families of mutually disjoint sets Ci,...,Ck € B(R™), the 
random variables N(C,),...,N(Cx) are independent, and 


(ii) for any set C € B(IR™) such that v(C) < oo, 


P(N(C) =k) = ao (k >0). 


If v is of the form v(C) = f,,\(a)da for some non-negative measurable function 
A: R™ > R, the Poisson ea i is said to admit the intensity function A(x). 
If in addition \(a) = A, N is called a homogeneous Poisson process (HPP) on R™ 
with intensity or rate X. 


Definition 10.2.8 A point process N on R™ is called stationary if for all families 
of measurable sets Cy,...,CK of R™, K => 1, the distribution of the random vector 
(N(C, +.a),...,N(Ck +.4)) is independent of a € R™. 


EXAMPLE 10.2.9: A STATIONARY GRID, TAKE 1. A grid on R? is a determin- 
istic point process on R? whose points are 


(nT,,mTz) (n,m eZ), 


where 7, and 7) are positive real numbers. It is not a stationary point process. 
However, the shifted version of it, 


(nT, +Vi,mT,+ V2) (n,m eZ), 


where V; and Vo are independent random variables uniformly distributed on [0, 7)) 
and [0, 72) respectively, is stationary. This can be proved directly, or by using the 
Laplace functional (Example 10.2.21). 
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Independent Point Processes 


Definition 10.2.10 The family N; (4 € I) of point processes on R™, where I is 
an arbitrary index, is called independent if for all finite sets of distinct indices 


igecupte iT, all integers £5 <.2, Ge, and all Ch, «.. m8 %, ‘*K(R™), the random 


weeiees 
(NaI are (Ge) 


are independent. 


Marked Point Processes 
Let N and {Xn}nen be as in Definition 10.2.1. Let (K,K) be some (R‘, B(R*%)) 
and let {Z,}nen be a random sequence with values in K. 
Definition 10.2.11 The sequence 1X, voces where 
Xn i= (Xn Zn) (NEN), 


defines a point process N on R™*4 called a marked point process on R with marks 
in K; {Zn}nen ts the mark sequence and N is the basic point process. 


The notation K for R¢ is used for rendering the distinction between the marks 
and the original point process N more visual. 


The random variable 
N(C x L) = So 1e(Xn)1n(Zn) (C € B(R™), L EK) (10.5) 
neN 


counts the number of points in the original point process N in C' with marks in 
L. Note that since A ¢ C, the points X,, € {A} do not appear in the sum above 
(“points at infinity are excluded”). We shall occasionally use the notation Nz 
instead of N. 


The following phrases are then considered equivalent: 
(i) “the marked point process {(Xn, Zn) }nen”, 


(ii) “the marked point process N”, 
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(iii) “the marked point process (N, Z)”. 
(iv) “the marked point process (Nz)”. 
Definition 10.2.12 If in addition {Zn}nen is UD and independent of N, with 


common probability distribution Qz, then N is called a marked point process with 
independent 1D marks. 


Point Process Integrals 


Since a point process is a sum of Dirac measures, point process integrals are in 
fact sums. 


Let ys be a measure on (R™, B(R™)) and let y : (R™, B(R™)) > (R, B(R)) be 
a measurable function for which the integral [ae y du is well defined. This integral 
is also denoted by pu(y). 


When JN is a point process, the following notations represent the same mathe- 
matical object (if it is well defined): 


S- o(Xn), | e@nean, N(y). 


nen 


In the first notation, we use the convention that the sum extends only to those 
indices n such that X, € R™, excluding the points “at infinity” (in fact, y(A) is 
not defined). In the situation of Definition 10.2.11, observe that 


(a, z)N( (da x dz) p(Xn, Zn), 
i a 


nen 


with the same convention as the one just agreed upon concerning points at infinity. 


The Intensity Measure 


Definition 10.2.13 Let N be a locally finite point process on R™. The set func- 
tion 

vHV(C):= E[N(C)| (C € B(R”)) 
defines a measure on (IR™, B(R™)), called the mean measure or the intensity mea- 
sure of N. 


The intensity measure of a marked point process N (Definition 10.2.11) is the 
measure N on (R™ x K,B(R™) @K) defined by 


WE := [MC] (C € B(R™) @K). 
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EXAMPLE 10.2.14: THE INTENSITY MEASURE OF A MARKED POINT PRO- 
CESS WITH INDEPENDENT IID MARKS. Let N be as in Definition 10.2.11. Denot- 
ing by Qz the common distribution of the marks and by v the intensity measure 
of the basic point process N, the intensity measure of N is the product measure 
v(da x dz) = v(dx)Qz(dz). The easy proof is left as an exercise. 


Campbell’s Formula 


Theorem 10.2.15 Let N be a point process on R™ with intensity measure v. 
Then, for all measurable functions y : R™ — R which are either non-negative 
or v-integrable, the integral N(y) is well defined (possibly infinite when vy is only 
assumed to be non-negative) and 


EB[N(y)] = u(y). (10.6) 
In particular, N(y) is a.s. finite if p is v-integrable. 


Proof. First, suppose that vy is a simple non-negative measurable function, that 


is, of the form 
L 
a Ano, ; 
h=1 


where L € N,, a, € Ry and C\,...,Cz are disjoint measurable subsets of R™. 
Then 


L L 
E[N(y)] = £ Soaanien |- S © anv(Ch) = v(¢). 

h=1 h=1 
Now let y be a non-negative measurable function and let {Yn}nen be a non- 
decreasing sequence of simple non-negative measurable functions with limit y. 
Letting n ft co in 

E[N(¥n)] = v(Pn) 

yields the announced result, by monotone convergence. In the case where y € 
Lp(v), since E[N(y*)] = v(y*) < oo, the random variables N(y*) are P-a.s. 
finite, and therefore N(y) = N(yt) — N(w) is well defined and finite, and 


E(N(y)] = E[N(¢*)] — EIN(¢)] = ue") — (v7) = vy) - 


EXAMPLE 10.2.16: CAMPBELL’S FORMULA FOR MARKED POINT PROCESSES 
WITH INDEPENDENT IID MARKS. Let N be as in Definition 10.2.11. Campbell’s 
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theorem then reads as follows. If the measurable function y : R™ x K — R is 
either non-negative or in Li(v x Qz), then the sum 


S- (Xn, Zn) 
neN 


is P-a.s. well defined (possibly infinite if y is only assumed non-negative) and 


Yo (Xn Zn) 


neN 


B = | _Elele, Z))v(ax), 


where Z is a K-valued random variable with distribution Qz. 


The Laplace Functional 


This functional plays for point processes a role analogous to that of the usual 
Laplace transform for random vectors. 


Definition 10.2.17 Let N be a point process on R™. The Laplace functional 
of N is the mapping Ly associating with a non-negative measurable function 
yp: R” > R, the non-negative real number 


In(y) :=E [eN)] 


EXAMPLE 10.2.18: THE LAPLACE FUNCTIONAL OF A POISSON PROCESS. 
Anticipating a later result (Theorem 10.3.7 thereof), the Laplace functional of a 
Poisson process on R™ with intensity measure v is 


Ly(v) = exp tf, (ec? — 1) v(az)} 


Theorem 10.2.19 The Laplace functional of a locally finite random measure N 
on E characterizes its distribution. 


Proof. It suffices to show that the Laplace functional of a point process N charac- 
terizes its finite-dimensional distributions. For this, just observe that for all K > 1 


and all disjoint measurable sets C1, ..., Cx in B(R™), the Laplace transform of 
the vector (N(C{),...,N(Cx)), that is, the function 
(ty, ..ste) ERE 4 B [ett MCR, 


is of the form FE [e“N@)], where y =tile, +--+ trlo,. 
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Corollary 10.2.20 A point process N on R™ is stationary if and only if its 
Laplace functional Ly is such that 


Ly(v) = Ly (Say) 


for all non-negative functions p from R™ to R and alla € R™, where (Say) (t) := 
y(t — a). 


EXAMPLE 10.2.21: A STATIONARY GRID, TAKE 2. In order to prove the 
stationarity of the shifted grid of Example 10.2.9, it suffices to show that for any 
non-negative function y from R? to R, the quantity 


E [ernmen rinTitVi tanTa+Va tA) 


is independent of a, 6 € R. This quantity equals 


Ty T> 
| { | eLnmen PAT +u-tanTs+02+B) ava} aie. 
0 0 


The conclusion follows from the fact that for any non-negative function wy :R — R, 


T T 
| w(nT +u+a)du =f (nT + u) du 
0 0 


for all a € R, by the shift-invariance of the Lebesgue measure. 


Theorem 10.2.22 The family N; (i € I) of point processes on R™, where I is 
an arbitrary index set, is an independent family if and only if for any finite subset 
J CT, and any collection py; (t € J) of non-negative measurable functions from 
R” to R, 

E [ew Dies Bala) = [4 [eM] ; (10.7 

ied 

Proof. The sufficiency follows immediately from the definition of independence 
for point processes. The necessity is left as an exercise. 


EXAMPLE 10.2.23: THE LAPLACE FUNCTIONAL OF THINNED POINT PRO- 
CESSES. Let N be a simple point process on R™ with point sequence {X;}nen. 
Let {Zn}nen be an MD sequence of independent marks of N, each Z, taking its 
values in {0,1}, with probability p € (0,1) for the value 1. The point process 
Ninin,p defined by 

Nininp(C) = S > 1o(Xn) Zn 


nen 
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is called the p-thinning of N. Each point of N is retained in Nipin,» with probability 
p, independently of everything else. We compute the Laplace functional of the 
thinned point process: 


LNjninp(P) = E fexp {- d atxayea} 
=limE oo - eat 
_ = 
E on {-Sovtaarea} =E [osr(-ol%)2) 
=E\E To (-rtx)2n |X, sal 
=F [] to (200) li ied 7] 
=E Tl tresn(-o(%) si pi 


= E \exp (>: log (pexp(—p(Xn)) + (1 — ») 


n=1 


Therefore finally, after letting k + co, 


LNininy (Y) = Ly (—log (pe"* +1—p)) . 


For future reference, we record the intermediary result obtained in the line 
before the last one in the above calculation: 


LNininp(P) = E lexp tf, log (1 — p(1 — e-°)) (az) ; (10.8) 


10.3 Spatial Poisson Processes 


Recall the definition given in Example 10.2.7. 
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Definition 10.3.1 Let v be a o-finite measure on R™. The point process N on 
R™ is called a Poisson process on R™ with intensity measure v if 


(i) for all finite families of mutually disjoint sets C,...,Cx € B(IR™), the 
random variables N(C;),...,N(Ck) are independent, and 


(ii) for any set C € B(IR™) such that v(C) < 00, 


(k > 0). 


If v is of the form v(C) = J, A()dx for some non-negative measurable function 
A: R™ — R, the Poisson process N is said to admit the intensity function A(z). 
If in addition A(x) = A, N is called a homogeneous Poisson process (HPP) on R™ 
with intensity or rate X. 


We now construct the Poisson process. The basic result is the following: 


Theorem 10.3.2 Let T be a Poisson random variable of mean @. Let {Zn}n>1 be 
an IID sequence of random elements with values in R™ and common distribution 
Q. Assume that T is independent of {Zn}n>1. The point process N on R™ defined 


by 
N(C)= 5 1c(Zn) (C € B(R™)) 


is a Poisson process with intensity measure v(-) :=@ x Q(-). 


Proof. It suffices to show that for any finite family C),...,Cx of pairwise disjoint 
measurable sets of R™ with finite y-measure and all non-negative reals t),...,tx, 


K 


Ble D4 NC} = 1, exp {v(C,)(e# - )}. 


We have 
K K T T K ? 
DUNC) = dot (> a) =>, (>: a) =DOYn 
j=l j=l n=1 n=1 \j=l n=1 


where Y, = se t;1c,(Zn). By Theorem 3.2.22, 


=1 5 
Efe ™*1**] = gr(Ele™)), 
where gr is the generating function of T. Here, since T is Poisson mean @, 


gr(z) = exp {6(z—1)}. 
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The random variable Y; takes the values t;,...,f« and 0 with the respective prob- 


abilities Q(C), ..., Q(Cx) and 1 — a Q(C;). Therefore 


from which we obtain the announced result. 


The above is a special case of what is to be done, that is, to construct a Poisson 
process on R™ with an intensity measure v that is o-finite (not just finite). Such 
a measure can be decomposed as 


v6) = 04; x 0), 


where the 6,’s are positive real numbers and the Q;’s are probability distributions 
on R™. One can construct independent Poisson processes NV; on £ with respective 
intensity measures 6;Q,(-). The result then follows from the following: 


Theorem 10.3.3 Let v be a o-finite measure on R™ of the form v = OX, %, 
where the 4;,’s (i > 1) are a-finite measures on R™. Let N; (i > 1) be a family of 
independent Poisson processes on E) with respective intensity measures 4; (i > 1). 
Then the point process 

N=S°N; 


jal 
is a Poisson process with intensity measure Vv. 


Proof. For mutually disjoint measurable sets C), ..., Cx of finite v-measures, 
and non-negative reals t,, ..., tx, 


E le" Tees tle) 2% le" Beds men] 


—Ff ea Der (OL Ns a] 


=lmFE le" Wiki te(Dja1 ne) i 


ntoo 
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by dominated convergence. But 


E le" Dik (Sa NG) —Ff le" Di (res ne) 


- II E [e~ Tihs tN; 6A) - II [[enemica 
= L Tew {(e~ — 1)v;(Ce) } 

= Io {> (e~* — 1) vica} 

a »» (e*—1) (>: H«c0)) | 


Letting n ¢ co we obtain, by dominated convergence, 


E le" TésHNCe)| = exp >» (e* — 1) uci} : 


f=1 


Therefore N(C)), ..., N(Cxk) are independent Poisson random variables with 
respective means (C1), ..., v(Ck). 


Theorem 10.3.4 Let N be a Poisson process on R™ with intensity measure v. 
(a) If v is locally finite, then N is locally finite. 


(b) Ifv is locally finite and non-atomic, then N is simple. 


Proof. (a) If C is a bounded measurable set, it is of finite v-measure, and therefore 
E|N(C)| = v(C) < co, which implies that N(C) < co, P-almost surely. 


(b) It suffices to show this for a finite intensity measure v(-) = 0(-) Q, where 0 
is a positive real number and Q is a non-atomic probability measure on R™, and 
then use the construction of Theorem 10.3.2. In turn, it suffices to show that for 
each n > 1, P(Z; = Z; for some pair (i,j)(1 <i < 7 < n)| N(R”) =n) = 0. 
This is the case because for IID vectors Z1,...,Z, with a non-atomic probability 
distribution, P(Z; = Z; for some pair (i,j) (1<i<j<n)) =0. 


EXAMPLE 10.3.5: ‘THINNED POISSON PRocEss. If the initial point process is 
a Poisson process with the locally integrable intensity measure v, 


Len iccpes (vy) = In(w) =e Siam (e791) v(dz) 
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where w(x) := —log (pe?) +1- p). Therefore ec? = pe?) +1—p= 
p(e~?) — 1) + 1 and finally 


LNininy P) =e Sam (e~?() -1) pu (de) 


We therefore retrieve the standard result: p-thinning a Poisson process of intensity 
measure 1/(-) results in a Poisson process of intensity measure v,(-) = pv(-). 


Doubly Stochastic Poisson Processes 
Doubly stochastic Poisson processes are also called Cox processes. 


Let {A(x)}+erm be a real-valued non-negative stochastic process such that al- 

most surely 
A(x) dx < co for all bounded C € B(R™). 
Cc 

A point process is constructed as follows: first generate the stochastic intensity 
process {X(x)}yeRm and, having done so, generate a Poisson process N with this 
intensity. The resulting point process is called a doubly stochastic Poisson process 
(or Cox process) with the (stochastic) intensity function {A()}seRm. 


In the case where 


Mx) =A (# €R™), 
where A is a non-negative finite random variable, the corresponding Cox process 
is also called a mired Poisson process. 
The Covariance Formula 


Let N be a Poisson process on R”™, with intensity measure v. Recall Campbell’s 
theorem (Theorem 10.2.15). Let py: R™ — R be a v-integrable measurable func- 
tion. Then N(y) is a well-defined integrable random variable, and 


EB | [ a) (do) = | _ lo) v(dz). (10.9) 


Theorem 10.3.6 Let N be as above. Let y,wW : E + C be two v-integrable 
measurable functions such that moreover |p|? and |u|? are v-integrable. Then N(y) 
and N(w) are well-defined square-integrable random variables and 


cov ( | P00) Naa), J Yo) (ax) - | _p(a)¥(e)* (dz). (10.10) 
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Proof. It is enough to consider the case of real functions. First suppose that y 
and w are simple non-negative Borel functions. We can always assume that 


K K 
g:=Soalo,, b:= > dale, 
h=1 h=1 
where C,...,Cx are disjoint measurable subsets of R™. In particular, y(x)y(x) = 


yo, anbnlo, (x). Using the facts that if i 4 j, N(C;) and N(C;) are independent, 
and cist a Poisson random variable with mean @ has variance 0, 


K 


EIN(Y)N()] = os anbE[N (Cr) N(Ci)| 


hl=1 


II 
M 
= 
> 
& 


N(C)] + Y> ah ELN(Ci)?] 


= S- ap,oE N(C))|E N(C))] |+ Yan 5 


and therefore 


EIN(y)N(wW)] = S> anbiv(Cn)v(Cr) + Y= arby[v(C) + ¥(Cr)?] 


hl=1 l=1 
hAl 
k k 
=> ye andy (Ch)v(C1) + S > aibiv( (Ci) 
hl=1 l=1 


= V(y)u(w) + u(py). 


Let now y, w be non-negative and let {yn}noi, {Wn}ns1 be non-decreasing se- 
quences of simple non-negative functions, with respective limits y and w. Letting 
n go to oo in the equality 


ELN(Gn)N (n)] = ¥(En'thn) + U(Gn)UYn) 
yields the announced results, by monotone convergence. 
We have that for any v-integrable function py: E > C 
E(N(y)] = E[N(¢*)] — E[N(e)] = o(¢*) — u(y) = vy). 


Also by the result in the non-negative case, E [N(|y]|)?] = v(\y|?) + v([yl)? 
Therefore, since |N(y)| < N(|v|), N(yv) is a square-integrable variable, as well 
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as N(w) for the same reasons. Therefore, by Schwarz’s inequality, N(y)N(w) is 
integrable. We have 


E(N(y)N(p)] = £ [(N(e*) — N(¢-)) (N(Y*) — NO))] 
= E(N(e*)NW*)) + E [Ne )N()] 
—E(N(e) NY) - EIN NY") 
= (v(eta) +u(e*)u(o")) + Ved) +u(e ud) 
— U(etdy) toed) — Ue ¥*) tue HH") 
=u(ph) + (yuh), 
from which (10.10) follows. 


The Exponential Formula 


We now turn to the exponential formula for Poisson processes. 


Theorem 10.3.7 Let N be a Poisson process on R™ with intensity measure v. 
Let p: R™ > R be a non-negative measurable function. Then, 


ae ae BEIGE) = exp {/ (e-?() —1) (a)} 


m 


and 


Elelam #() N42) — exp {| (eo = 1) (az) } : 


Proof. We prove the first formula, the proof of the second being similar. Suppose 
that y is simple and non-negative: y = eae anlo, where C),...,Ck are mutually 
disjoint measurable subsets of IR”. Then 


m 


Efe N®] = E [e- Er an(en)) _F 


K 
II ans) 


h=1 
K 


= II E [enue (Cn)) = [[ ce tie _ 1)v(C;,)} 


= exp {de = uci} = exp {v(e* —1)} . 


The formula is therefore true for non-negative simple functions. Take now a non- 
decreasing sequence {Y,}n>1 of such functions converging to y. For all n > 1, 


Ele-N()] = exp {v(e“*" — 1)} . 
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By monotone convergence, the limit as n tends to oo of N(yp) is N(vp). Conse- 
quently, by dominated convergence, the limit of the left-hand side is E[e~“?)]. The 
function g, = —(e~*" — 1) is a non-negative function increasing to g = —(e~-? — 1), 
and therefore, by monotone convergence, v(e~*" — 1) = —v(g,) converges to 
v(e-’ — 1) = —v(g), which in turn implies that the right-hand side of the last 
displayed equality tends to exp {v(e~” — 1)} as n tends to co. 


The covariance formula can of course be obtained from the exponential formula 
by differentiation of t~> E [etN)] . 


EXAMPLE 10.3.8: THE MAXIMUM FORMULA. Let N be a simple Poisson pro- 
cess on IR” with intensity measure v and let py : E — R. Then 


P(sup y(X;) < a) = exp {- is Lieteyna na) } : 


nen 


A direct proof based on the construction of Poisson processes in Section 10.3 is 
possible (Exercise 10.5.21). We take another path and first prove that 


ae [oP Ynen Nv(Xn)>a}] = P(sup p(Xn) < a) . (x) 
oe neN 


Indeed, the sum >, cn lty(x,,)>a} is strictly positive, except when sup,cy (Xn) < 
a, in which case it is null. Therefore 


i —O nen l{y(Xn)>a} — 
Him @ TU Sanen 1926) = Lisupnen e(Xn)Sa} ° 


Taking expectations yields (x), by dominated convergence. Now, by Theorem 
10.3.7, 


E [e-8 Den Molen >a) ] Sey {/ (ePew@>ar = 1) (az)} 
R™ 


= exp {/ (e° — 1) Leterme (ae) } 


and the limit of the latter quantity as 9 t 00 is exp {— [ae Lfy(z)>a}¥(dx) } . 


EXAMPLE 10.3.9: THE LAPLACE FUNCTIONAL OF A POISSON PROCESS. Ac- 
cording to Theorem 10.3.7, the Laplace functional of a Poisson process N on R™ 
with intensity measure v is 


Ly(y) = exp {v (e" —1)} . 
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Theorem 10.3.10 Let N; (i € J) be a finite collection of simple point processes 
on R™. If for any collection yp; : E > R, (i € J) of non-negative measurable 
functions, 


E [e~ Lies Mv] = | | exp tf. (e-¥"@ — 1) n(a)} (10.11) 


iE 


where v;,1 € J, is a collection of o-finite measures on IR™, then N;,i € J, is 
a family of independent Poisson processes with respective intensity measures 1, 
ie 

Proof. Taking all the y,’s identically null except the first one, we have 


E [eM 0] = exp { [ @ro-n n(ar)} | 


and therefore N, is a Poisson process with intensity measure v,. Similarly, for any 
i € J, N; is a Poisson process with intensity measure v;. Independence follows 
from Theorem 10.2.22. 


Marked Spatial Poisson Processes 


Let 


(a) N be asimple and locally finite process on R™, with point sequence {Xp} nen, 
and 


(B) {Zn}nen be a sequence of random elements taking their values in the mea- 


surable space (K,K) := (IR7, B(IR“)) for some integer d > 1. 


The sequence {X,,, Zn }nen is a marked point process, with the interpretation that 
Zy is the mark associated with the point X,. N is the base point process of the 
marked point process, and {Z,}nen is the associated sequence of marks. One also 
calls N a simple and locally finite point process on R™ with marks {Z,}nen in K. 
If moreover 


(1) N is a Poisson process with intensity measure v, 
(2) {Zn}nen is an IID sequence, and 


(3) {Zn}nen and N are independent, 
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the corresponding marked point process is called a Poisson process on R™ with 
independent 11D marks. This model can be slightly generalized by allowing the 
mark distribution to depend on the location of the marked point. More precisely, 
we replace (2) and (3) by 


(2’) {Z,}nen is, conditionally on N, an independent sequence, 
(3’) given X,, the random vector Z, is independent of X;, (k € N,k An), and 


(4’) for alln € N andall LEK, 
P(Z, € L| Xn) = Q(Xn, L), 


where Q(-,-) is a stochastic kernel from (R™, B(R™)) to (K, K), that is, Q is 
a function from R™ x K to [0, 1] such that for all L € K the map «+> Q(z, L) 
is measurable, and for all x € R™, Q(z,-) is a probability measure on (K, K). 


Theorem 10.3.11 Let {Xn, Zn}nen be as in (a) and (3) above, and define the 
point process N on R™ x Kk by 


N(A) = 0 1a(Xn, Zn) (A € B(R™) @K). (10.12) 


nen 


If conditions (1), (2’), (3’), and (4’) above are satisfied, then N is a simple Poisson 
process with intensity measure V given by 


Hie <i) = | Q(a,L)v(dz) (Ce B(R™), LEX). 
a 
Proof. In view of Theorem 10.2.19, it suffices to show that the Laplace functional 


of N has the appropriate form, that is, for any non-negative measurable function 
oe: ExK->R, 


E je *] = exp { fam fie (0 P% — 1) H(dt x dz)} . 
By dominated convergence, 


E en =f [e denen P(Xn Zn) = lim E le" Uns ace : 
Ltoo 
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For the time being, fix a positive integer L. Then, taking into account assumptions 


(2’) and (3’), 


E fen Past Xmen] =F 


II oon) 


n<L 
n<L 


— fF le" at i) ; 


=E\E 


where ~(x) := — log = e %(") Q(x, dz), a non-negative function. Letting L t 00, 
we have, by dominated convergence, 


E fe] = B [ew Enew¥%n)] = B [eNO] 
[erent (ar) 
I. fe -#@2)Q(, dz) — 1 vax) 

| [¢ Bia ~~ v(az)} 
= exp [fe e ) — 1) H(dax x dz) 


= exp 


EXAMPLE 10.3.12: THE M/GI/oo MODEL, TAKE 1. The model of this example 
is of interest in queueing theory and in the traffic analysis of communications 
networks. We adopt the queueing interpretation. Let N be an HPP on R with 
intensity A, and {on}nez be a sequence of random vectors taking their values in 
R, with probability distribution Q. Assume moreover that {o,}nez and N are 
independent. The n-th event time of N, 7), is the arrival time of the n-th customer, 
and o,, is her service time request. Define the point process N on R x Ry by 


= ye 1co(Th, On) 


neZ 


for all C € B(R) @ B(R,). According to Theorem 10.3.11, N is a simple Poisson 
process with intensity measure 


v(dt x dz) = Adt x Q(dz). 
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In the M/GI/oo model,” a customer arriving at time T;, is immediately served, 
and therefore departs from the “system” at time T,, + 0,. The number X(t) of 
customers present in the system at time t is therefore given by the formula 


X(t) = oD legge lee lr Gis) : 
neZ 


(The n-th customer is in the system at time ¢ if and only if she arrived at time 
T,, < t and departed at time 7), + 0, > t.) 


Assume that the service times have finite expectation: FE [0] < oo. Then, for all 
t € R, X(t) is a Poisson random variable with mean \F [a]. 


Proof. Observe that 7 
X(t) = N(C(t)), 


where C(t) := {(s,0); s <t, s+oa0 >t} C Rx R,. In particular, X(t) is a 
Poisson random variable with mean 


v(C(t)) = ai Listo>t}lysce}V/(ds x dc) 
+ 


-/ | Ltstost}liscxyA ds x Q(do) 
RJR, 


-| ¢ tases} @da)) lys<t}A ds 
rR \JRy 


= rf Q((t — s, +00)) ds 


—oo 


= ape Q((s,-+0o)) ds = fe P(o, > s)ds = AE[o,]. 


It can be shown that the departure process D of departure times, defined by 


D(C) = > 1o(Tr ole On) ? 


neZ 


is an HPP of intensity (Exercise 10.5.16). 


2 “oo” represents the number of servers. This model is sometimes called a “queueing” system, 


although in reality there is no queueing, since customers are served immediately upon arrival 
and without interruption. It is in fact a “pure delay” system. 
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Formulas such as Campbell’s first formula and the Poisson exponential formula 
are straightforwardly extended to marked point processes. 


In the situation prevailing in Theorem 10.3.11, consider sums of the type 


N(G):= >- G(Xn, Zn), (10.13) 


neN 


for functions ¢: R™ x K +R. Note that, denoting by Z;(2) any random element 
of AK with the distribution Q(z, dz), 


HO) = ff Hle.)Q(e.d2) (a2) =f E(w, Zu(a))] (Ar), 


whenever the quantities involved have a meaning. Using this observation, the for- 
mulas obtained in the previous subsection can be applied in terms of marked point 
processes. The corollaries below do not require proofs, since they are reformula- 
tions of previous results, namely Theorem 10.3.6 and Theorem 10.3.7. 


Let 0 < p < oo. Recall that a measurable function @ : E x K — R (resp. 
— C) is said to be in LR(V) (resp. Li,(V)) if 


[fea o(a2) Q@, 42) <2. 


Corollary 10.3.13 Suppose that 6 € Lu(v). Then the sum (10.13) is well de- 
fined, and moreover 


E x Ain 2a) = [B® z@)]H(a2). 


nen 


Let G,: Rx E> C be two measurable functions in Li,(D) N L2,(0). Then 


cov (= O(Xn; Zn); S- U(Xn, z)) 


= i EB [B(e, Za(@) Ww, Za(0))"] waz). 
R™ 
Corollary 10.3.14 Let p be a non-negative function from R™ x K to R. Then, 


B [o~ Dnen PXns2n)] = exp { | 


E [eo FeZa)) _ 1] v(az)} : 


m 
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10.4 Operations on Poisson Processes 


The framework and the results concerning marked Poisson processes is especially 
convenient to study the effects of various operations on Poisson processes, such as 
thinning, coloring, transportation, translation and filtering. 


Thinning and Coloring 


Thinning is the operation of randomly erasing points of a Poisson process. It is 
a particular case of the independent coloring operation whereby the points of a 
Poisson process are independently colored with the result of obtaining independent 
Poisson processes, each one corresponding to a different color. 


Theorem 10.4.1 Consider the situation depicted in Theorem 10.3.11. Let I be 
an arbitrary index set and let {L;},<, be a family of disjoint measurable sets of K. 
Define for eachi € I the simple point process N; on R™ by 


C) aa Se 1o{X,,)1,(Z,) ‘ 


nen 


Then the family N; (« € I) is an independent family of Poisson processes with 
respective intensity measures v;, (i € I), where 


y;(dz) = Q(a, L;) v(dz) . 


Proof. According to the definition of independence, it suffices to consider a finite 
index set J. Define the simple point process N on R™ x K as ee a — Then N 
is a Poisson process with intensity measure 7(C x L) = [..Q S dx). Defining 
p(x, 2) = der Pile) 12, (2), we have S767 Ni(@) = N(§). — 


B [en Dies MU) = EB [e¥®)] 


= exp { [ . I (e-#@) _ 1) (dz x a} 
=e { | i, (e-#*) — 1) Q(x, dz)u(av)} 
= exp tf, [ (e~ Vier PH(*)12,@) _ 1) Q(x, dz)n(ax)} 


I. ‘> (ec? — 1) 11,(2)Q(z, sowian} 


= exp 
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-o0{ [OK eile) _ 1) Q(z, nas} 


tel 


- =Teoo{ [ (ee) — 1) Q(e. L)o(a)} 


ier 


Therefore, 


E [en Vier N(o)] = Tow] [te (vi) _ 1) u(ae)} 


tel 


and the result follows from Theorem 10.3.10. 


The above theorem is indeed about thinning. For instance the point process N; 
is obtained by thinning of N, each point x of which being saved with probability 
Q(x, L1). 

EXAMPLE 10.4.2: ERASURES. Let in this special case A = {0,1} and 
P(Z, =1|X_, = 2) = p(z). 
We shall now define the point processes N? and N” on R™ by 


=> Z,lo(Xn) and N*(C) = $7 (1 — Zn)1c(Xn). 


n>1 n>1 


The interpretation is that N’ is obtained from N by erasing points, a point of the 
original point process N located at x being erased with probability p(#) indepen- 
dently of everything else. 


By Theorem 10.4.1,his point t process is a Poisson process with intensity mea- 
sure 


DP? (da) := p(x)v(dz) . 


Transportation 


This is the operation of moving the points of a Poisson process. 


More precisely, consider the situation depicted in Theorem 10.3.11. Form a 
point process N* on K by associating to a point X, € R™ a point Z, € K: 


=Su(Z 


nen 
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where L € B(R™). We then say that N* is obtained by transporting N via the 
stochastic kernel Q(2,-). 


Theorem 10.4.3 N* is a Poisson process on K with intensity measure v* given 
by 


v*(L) =| v(dx)Q(a, L). 
Proof. Let y* : K — R be a non-negative measurable function. We have 


E [eM e)] = Ff [e7 Dnen (Zn) - 


=en{ | m [e = ) v(dx)Q(x az) 
= exp {f (e""® —1) Vf v(dx)Q(x, a} . 


EXAMPLE 10.4.4: TRANSLATION. Let N be a Poisson process on R™” with 
intensity measure v and let {V,},,cj be an ID sequence random vectors of R™ 
with common distribution Q. Form the point process N* on R™ by translating 
each point X,, of N by V,,. Formally, 


N*(C) = 0 lo(Xn + Va). 


nen 


We are in the situation of Theorem 10.4.3 with Z, = X,+V,. In particular, 
Q(a, A) = Q(A—2z). It follows that N* is a Poisson process on R™ with intensity 
measure 


v*(L) = o Q(L—«x)v(dz), 


the convolution of v and Q. 


Poisson Shot Noise 


Let N be a simple and locally finite point process on R™” with point sequence 
{X,,}nen and with marks {Z,,}ncn in the measurable space (,K). Let h : R™ x 
kK — C be a measurable function. The complex-valued spatial stochastic process 
{X (y) }yerm given by 

= S°h(y — Xn, Zn) (10.14) 


nen 
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where the right-hand side is assumed well defined (for instance, when / takes real 
non-negative values), is called a spatial shot noise with random impulse response. 
If N is a simple and locally finite Poisson process on R™ with independent IID 
marks {Zp}nen, {X(y)}yeem is called a Poisson spatial shot noise with random 
impulse response and independent 11D marks. 

The following result is a direct application of Theorems 10.3.6 and 10.3.11. 


Theorem 10.4.5 Consider the above Poisson spatial shot noise with random im- 
pulse response and independent 1D marks. Suppose that for all y € R™, 


[Bln 2, (ax) <0 


and 


I. E||A(y -2, ZI" v(dxz) < oo. 


Then the complex-valued spatial stochastic process {X(y)}yerm given by (10.14) is 
well defined, and for any y,€ € R™, we have 


EIXW) =f Bly -2,2Z,))(dz) 
and 


cov (X(y + €), X(y)) = [. He) aan (es a 2a) dee 


In the case where the base point process N is an HPP with intensity A, we find 
that 


EIXW)=A f  Blh(e,2,)] dx 


ii 
and 


cov (X(Y+E, XW) =A f Elbe, Z)h(E +2, 2) de. 
Observe that these quantities do not depend on y € R™. The process {X(y)} yewm 
is therefore a wide-sense stationary process (see Chapter 12 for a definition). 


10.5 Exercises 


Exercise 10.5.1. BACKWARD AND FORWARD RECURRENCE TIMES 
Let {T;,} 


nez be the sequence of event times of an HPP on R with the intensity A > 0. 
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For fixed t € R, define the backward and forward recurrence times respectively by 


B(t) = inf {t -—T,,; T, < t} 
F(t) = inf {T,, —t; T, >t} 


What is the distribution of the vector (B(t), F(t))? Compute E[B(t) + F(t)]. 


Exercise 10.5.2. POISSON AND MULTINOMIAL 

Let N be a HOMOGENEOUS Poisson process on R™ with intensity A. Let Ci, ..., 
Ck be disjoint bounded measurable sets of IR”, and call C their union. Let n be 
an integer. What is the conditional distribution of the vector (N(C1),..., N(Cx)) 
given that N(C) =n? 


Exercise 10.5.3. POISSON UNDER THE LINE 

A. Let N be an HPP on R? with intensity 1. Let \: R > R, be a non-negative 
locally integrable function. Define a point process N on R as follows. The point 
t € N if and only if there exists a z € R such that 0 < z < X(t) and (t,z) € N. 
Prove that N is a Poisson process on R with intensity function A(t). 


B. Let N be a Poisson process on R with intensity function A(t). Denoting by T,, 
the n-th point of N strictly to the right of the origin, prove that T;, is an absolutely 
continuous random variable and give its probability density. Give an expression 
for the joint density of (Th, S41). 


Exercise 10.5.4. POISSONIAN DISKS 
Let N be a homogeneous Poisson process on R?, of intensity 4. Draw around each 
point z € N aclosed disk of radius a. Let X (y) be the number of disks covering 
y € R’. 
1) Compute for y € R?, 0 €R, 

E [e-OX)] : 


d 


2) Deduce from this result the probability distribution of X (y); 
3) Give the average area inside the square [0,7] x [0,7] that is not covered by a 
disk; 


4) This area is delimited by a curve. Give its average length (excluding the parts 
on the boundaries of [0,7] x [0,7)). 


Exercise 10.5.5. LINE OF SIGHT 
Consider a Poisson N on R? with diffuse and locally finite mean measure v. There 
is a random shape centered around each of its points. Let the generic shape S 
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be distributed according to some probability distribution Qs. Now consider two 
arbitrary points A, B. We say that A and B can communicate if the line connecting 
A and B does not intersect any of the existing shapes around the points of the 
point process (for all n > 1, the “existing shape around” X, € N is X,+ Sh, that 
is S,, translated by X,, where S, is distributed according to Qs). We assume that 
{Sn}n>1 is an IID sequence independent of VN. What is the probability that A and 
B can successfully communicate? Keep the calculations as general as possible, and 
then, give the explicit result when N is an HPP of intensity \ and when the shape 
is (1): a circle of fixed radius a; (2) a circle of random radius uniformly distributed 
on [0, 1]. 


Exercise 10.5.6. CELLPHONES 

Consider two independent Poisson processes VN, and Nj on R™” with respective 
mean measures 4, and v2. Assume that v;(R™) < oo,i = 1,2. Compute the 
average number of elements in N; that see no point of Nj with distance a. 


Exercise 10.5.7. MUTUALLY SINGULAR 

Let N be a point process on R defined on a measurable space (Q, F). Let P, and 
P be two probability measures on (Q,F) that make of N an HPP of intensity 
A, > 0 and Az > 0 respectively. 


Show that if Ay 4 A2, P, and P, are mutually singular, that is to say, that there 


exists a set A € F such that P(A) = 1 and P,(A) = 0. 


Exercise 10.5.8. COUPLED HPPS 
Let for i = 1,2, A; : R, — R be a non-negative measurable function, locally 
integrable. Suppose that 


[wo ee ee 


Show that one can construct on the same probability space (Q,*F,P) two Poisson 
processes on R,, with respective intensity functions A,(t) and A2(t), with the 
following coupling property: 


There exists an almost surely finite random variable 7 such that 


P(Ni(C'N [r,00)) = No(C'N |r, co)), for all C € B(R,)) =1. 


What is the probability distribution of the last point Z of either N; or No that is 
not a shared point? 
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Exercise 10.5.9. GAUSSIAN LIMIT OF A SHOT NOISE 
Consider the shot noise process {X(t)}ier given by 


X(t)=S 7 A(t-Th), 
nez 
where {7),}nez is an HPP on R with intensity \ = n\o and h(t) = saho(t) for 
some integrable function ho(t) such that J, ho(t)dt = 0. Show that the finite 
distributions of {X(t)}:er converge as n + oo to the finite distributions of a cen- 
tered Gaussian process {Y(t)}ier with covariance function E[Y(t+7)Y(t)] = 
Xo fe ho(s + T)ho(t) dt. 


Exercise 10.5.10. PROVE THEOREM 10.2.22 
Prove Theorem 10.2.22. 


Exercise 10.5.11. DROPPING HUMANITARIAN PARCELS 

Parcels are dropped on the plane R?. The impact times {Tn}nez form a simple 
Poisson process of mean measure v, the impact locations {Z,},<-z are IID and 
independent of the impact times, and their common probability distribution is Q. 
A “shape” moves on R? in order to collect. the parcels as they impact on it. More 
precisely, there is for each time t € R a measurable subset S(t) € B(R?) and the 
point process N counting the parcels falling on the shape is defined by 


N(C) = So lo(Tn)1s@(Zn) - 
neZ 
Prove that N is a Poisson process and give its mean measure. 
Exercise 10.5.12. POISSONIAN DISC CLUTTERS 
Let N be a homogeneous Poisson process on R?, of intensity 4. Draw around each 


point « € N a closed disk of radius a. Let X (y) be the number of disks covering 
y € R’. 


1) Compute for y € R?, 0 € Ry 
EB [e O*@)] . 


oI 


2) Deduce from this result the probability distribution of X (y); 


3) Give the average surface inside the square [0,7] x [0,7'] that is not covered by 
a disk; 


4) This surface is delimited by a curve. Give its average length (excluding the 
parts on the boundaries of [0,7] x [0, T]). 
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Exercise 10.5.13. RANDOM POINTS UNIFORMLY DISTRIBUTED ON (0, 1] 
Construct a point process N on R in the following way. First draw a finite integer- 
valued random variable T, and then an IID sequence {U,,}n>1 uniformly distributed 
on [0,1], independent of T. Define a, := P(T = k) (k > 0). Finally, let N = 
aes where é€, is the Dirac measure at a, and where oe év, is the null 
measure by convention. What is the Laplace functional of N? What about the 
case where T is a Poisson variable of mean 0? 


Exercise 10.5.14. LAPLACE FUNCTIONAL OF A CONTRACTED POINT PROCESS 
Let N be a simple point process on R™” with point sequence {X,}nen and let 
a > 0. Define the “contracted”? point process N, defined by its sequence of 
points {aXn}nen. Prove that its Laplace functional is 


exp (- > eioxs)) = En (y(a:)). 


neZ 


Ly... (p) = # 


Exercise 10.5.15. DISTRIBUTION OF THE MAXIMUM INTERFERENCE 

Let N be a homogeneous Poisson process on R™ of positive intensity \ and with 
point sequence {X,}nsi. Let {Z,}n>1 be an MD sequence of real non-negative 
random variables with common distribution Q, and independent of NV. Compute 
the distribution of the random variable 


max Z,,¢° Fl (8 > 0) : 


(The title of the exercise refers to mobile communications: Z,, is the noise intensity 
generated at point X,, and e~*!/X"ll is an attenuation factor for a receiver located 
at 0.) 


Exercise 10.5.16. THE M/GI/oo MODEL, TAKE 2 
In Example 10.3.12, 


(i) prove that the departure process is a homogeneous Poisson process with 
intensity A, 


(ii) compute cov(X(t), X(t+7)) for all t,7 € R, 7 > 0, and 


(iii) interpret the process (X(t) }:eR as a shot noise in order to obtain the results 
of Example 10.3.12, and of (i) and (ii), from the general results of Section 
10.3 (subsection Marked Spatial Poisson Processes, page 403). 


3 Of course, if a > 1, it is in fact dilated... 
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Exercise 10.5.17. LIFTING 

Let N be a Poisson process on R with (locally integrable) intensity function 
A: R—>R. Let {Th}nez be its sequence of points, and let {Un}nez be an IID 
sequence of random variables uniformly distributed on [0,1]. Let N be an HPP 
on R x R,, with intensity 1, independent of N and of {U,}nez. Define a point 
process N onRx Ri by 


N(C) = So 1e((Tn, UnA(Tn))) + N(C NA), 


neZ 


where 
H :={(t,z) Rx R,;0<2< XM}. 


Show that NV is an HPP on R x Ry with intensity 1. 


Exercise 10.5.18. WATER BOMBS 

You are initially located at the origin (0,0) of the plane at which is centered a disk 
D of radius R. You run in a straight line from the origin to the “shelter point” 
(0, R) at constant speed v. The reason why you are running is that water bombs 
are being dropped on the disk D. The times of impact form an HPP of intensity 
A, and each impact is located independently of all the rest, uniformly on the disk. 
You will get wet if the impact of the bomb is within distance a of your position at 
the time of impact. Once arrived at the shelter point (0, R), the bombing stops. 
What are your chances of not getting wet? Given that you did get wet, what is 
the expected time that you remained dry? 


Exercise 10.5.19. SMOKING POT AT SAINT MARY-JANE’S 

Smoking pot was recently banned on the Saint Mary-Jane’s college campus. The 
authorities noticed that the violators of the ban make use of a restroom in a 
secluded wing of the campus. They consequently devised a strategy to send “cops” 
to capture the culprits. Assume that the schoolboys’ arrival times in the restroom 
premises form a Poisson process with independent IID marks. Let 7, denote the 
n-th arrival time of a schoolboy in the pot sanctuary (the restrooms) and let a, be 
the time he spends smoking. Cops also form a Poisson process with independent 
uD marks. Denote the k-th arrival time of a cop on the potential crime scene 
by JT; and by S; the lingering time there of the corresponding representative of 
the college authority. The probability distribution of o is Q, and that of S is 
Q.. Assuming the point processes of students and of the cops to be HPPs with 
respective intensities A, > 0 and A, > 0, compute the average number of students 
caught per unit of time. 
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Exercise 10.5.20. LAPLACE FUNCTIONAL OF A HOMOGENEOUS COX PROCESS 
Let N be a Cox process on R™ with constant intensity process, that is, v(dx) := 
Af™ (dz), where £™ is the Lebesgue measure on R” and A is a non-negative ran- 
dom variable with Laplace transform L(t) := E [e“]. Show that its Laplace 


functional is 
Ly(o)=2a(f (@-e)ar) 
R™ 


Exercise 10.5.21. THE MAXIMUM FORMULA 
Give a direct proof of the result of Example 10.3.8 based on the construction of 
Section 10.3. 


Exercise 10.5.22. LIKELIHOOD RATIO 

Let N be an HPP on R of intensity 1, and let A be a non-negative random variable, 
both defined on the same probability space (Q,F,P). Let T > 0 be a fixed real 
number. Define 


L(T) = AN exp(—(A — 1)T). 


(1) Show that £[L(T)] =1 


(2) Define a probability Q on (Q,F) by Q(A) = E[L(T)14]. Show that under Q, 
N restricted to the interval [0,7] is a Cox process with intensity \(t) = A. 


(3) Show that for ¢ € [0,7], 


o(N(t) + 14) 


EQlA|F,"] = aN (B),8) 


where 


y(n, t) = / ard), 
Ry 


is the CDF of A. 


Check for 
updates 


Chapter 11 


Brownian Motion 


Brownian motion owes its name to the botanist Robert Brown who observed the 
chaotic motion of pollen grains in a liquid. From the mathematical point of view, 
it received attention from Albert Einstein and Louis Bachelier. The latter was 
motivated by his interest in finance, finding that the model could serve to describe 
the fluctuations of the stock market, and nowadays, its role in mathematical fi- 
nance is well established. Brownian motion is also called the Wiener process, after 
Norbert Wiener, who introduced it in the theory of stochastic systems driven by 
white noise, a notion that we shall discuss in the next chapter. 


11.1  Continuous-time Stochastic Processes 


Some generalities on continuous stochastic processes are necessary before address- 
ing the central topic of this chapter. 


Definition 11.1.1 A stochastic process (or random process) is a family {X (t) ier 
of random variables taking their values in some measurable space (E,€) and de- 
fined on the same probability space (Q,F, P). 


(The spaces F of interest in this chapter are R™ (m > 1), C, Z and N.) 


It is called a real (resp., complex) stochastic process if it takes real (resp., 
complex) values, a continuous-time stochastic process when the index set T is R 
or R,, and a discrete-time stochastic process when it is N or Z. When the index 
set is N or Z, we also use the notation n instead of t for the time index, and write 
X,, instead of X(t). 


For each w € Q, the function t+ X(t,w) is called a trajectory (more precisely, 
the w-trajectory). This is why a stochastic process is sometimes called a random 
function. 
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EXAMPLE 11.1.2: RANDOM SINUSOID. Let A be some real non-negative random 
variable, let vo € R be a positive constant and let ® be a random variable with 
values in [0,27]. The formula 


X(t) = Asin(2rvpt + &) 


defines a stochastic process. For each sample w € Q, the function t+ X(t,w) isa 
sinusoid with frequency vo, random amplitude A(w) and random phase ®(w). 


One way of describing the probabilistic behavior of a stochastic process is by 
means of its finite-dimensional distribution. 


Definition 11.1.3 The finite-dimensional (fidi) distribution of a stochastic pro- 
cess {X(t)}ier ts the collection of probability distributions of the random vectors 


(X(t),..., X(t) (K>1jt,...,te eT). 


Definition 11.1.4 A stochastic process {X(t)}ier is said to be stationary iff for 
allk >1 and all ty,...,t, € R the probability distribution of the random vector 


(X(ti+7),...,X (te +7)) 


is independent of T. 


Definition 11.1.5 A complex stochastic process {X(t)}icr is said to have in- 
dependent increments if for all n > 2 and for all mutually disjoint intervals 
(a1, bi],---, (Gn, On] of R, the random variables 


X(b1) ~~ X (ay), -.«;X (Bn) > X (an) 


are independent. 


It is sometimes useful to view a stochastic process as a mapping X :TxQ > E, 
defined by (t,w) > X(t,w). 


Definition 11.1.6 The stochastic process {X(t)}ier is said to be measurable iff 
the mapping from R x Q into E defined by (t,w) + X(t,w) is measurable with 
respect to B(IR) @ F and €. 
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In particular, by the Fubini-Tonelli theorem (Theorem 4.4.7), for any w € Q 
the mapping t +> X(t,w) is measurable with respect to the o-fields B(R) and €. 
Also, if £ = R and if X(t) is non-negative, one can define the Lebesgue integral 


[ Xteayat 


for each w € 2, and also apply Tonelli’s theorem to obtain 


B| [ xwa = f eex@lar. 


By Fubini’s theorem, the last equality also holds true for measurable stochastic 
processes of arbitrary sign such that f, E [|X (t)|] dt < oo. 


The next theorem tells us that the stochastic processes occurring in applications 
are measurable. 


Theorem 11.1.7 A right-continuous (resp., left-continuous) stochastic process 
{X(t)}ier taking its values in R™ is measurable. 


Proof. For all n > 0 and all t > 0, let 


n(2”—1) 


X(t):= SY) X (C+ 1)/2-) Laan egsya-my (€)- 


k=—n(2"—1) 


The stochastic process {X,,(t) }ieg is measurable. Ift + X(t, w) is right-continuous, 
X(t, w) is the limit of X,,(t,w) for all (¢,w) € Rx Q, and therefore (t,w) H X(t,w) 
is measurable. The case of a left-continuous process is treated in a similar manner. 


Second-order Stochastic Processes 


Definition 11.1.8 A complex stochastic process {X(t)}ier satisfying the condi- 
tion 
EXP] <co (t€T) 


is called a second-order stochastic process. 


In particular, the mean function m : T > C and the covariance function 
[: Tx T > C are well defined by 


m(t) = E[X(0)] 
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and 
I(t, s) := cov (X(t), X(s)) = E[X(t)X(s)*] — m(t)m(s)* . 


When the mean function is the null function, the stochastic process is said to be 
centered. 


Theorem 11.1.9 Let {X(t)}ter be a second-order complex stochastic process with 
mean function m and covariance function T. Then, for all s,t € T, 


E(|X(t) — m()|] < T¢t,0)2 


and : , 
I(t, s)| < P(t, )40 (5, 82. 
Proof. Apply Schwarz’s inequality 
E(X|Ivs BAP]? BYP]? 


with X := X(t) — m(t) and Y := 1 for the first inequality, and with X := 
X(t) — m(t) and Y := X(s) — m(s) for the second one. 


Theorem 11.1.10 Let {X(t)}ica be a second-order complez-valued measurable 
stochastic process with mean function m and covariance functionl. Let f : RC 
be a measurable function such that 


[\roetxon dt < co. (ail) 
Then the integral J, f (t)X(t) dt is almost surely well defined and 
E if f(t)X(t) a =f feme dt . 
R R 
Suppose in addition that f satisfies the condition 
[ie@irent ar <0 (11.2) 
R 


and let g: R C be a function with the same properties as f. Then J, f (t)X (t) dt 
is square-integrable and 


cov ( | f(t)X(t) dt, i g(t) X(t) av) = i | f(t)g*(s)U(t, s) dt ds. 


Remark: Since E[|X(t)|] < E[l + |X(é|?] = 1+ (t,t), condition (11.1) is 
satisfied as soon as f is an integrable function such that f, | f(t)|['(t,t) dt < co. 
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Proof. By Tonelli’s theorem 
B| f iroxelad = [role xe) at <0 


and therefore & | f (t)||X (£)| dt < co almost surely, so that almost surely the inte- 
gral f, f(t)X(t) dt is well defined and finite. Also (Fubini) 


B| [ roxwat = [evoxela=[ roe 


Suppose now (without loss of generality) that the process is centered. By Tonelli’s 
theorem 


B ( [ OXIA) ( i j(oiix(@lae) 
= ff \rollasetx@Ix( atas. 


But (Schwarz’s inequality) E [|X (t)||X(s)|] < P(t,t)|21(s, s)|?, and therefore the 
right-hand side of the last equality is bounded by 


([iroire.ortar) ([ wrrt.s)as) s 


One may therefore apply Fubini’s theorem to obtain 


El (fsoxwa) (famxwa)] = [ [ror etxeoxe) atas. 


Obviously, for a stationary second-order complex stochastic process { X(t) hier, 
for all s,tE R, 


m(t) =m, (11.3) 
where m € C and 
I(t, s) = C(t—s) (11.4) 
for some function C : R > C, also called the covariance function of the process. 
The complex number m is called the mean of the process. 


Definition 11.1.11 A second-order stochastic process {X(t)}icr is said to have 
orthogonal increments if for all n > 2 and for all mutually disjoint intervals 
(a1, bi],..-, (Gn; On] of R, the random variables 


X(b1) =X (1), 02254 (q) _ X (an) 


are mutually orthogonal. 
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Clearly, a centered second-order stochastic process with independent. incre- 
ments has a fortiori orthogonal increments. 


Wide-sense Stationarity 

Let T= R, R,, Z or N, and let {X(t)},er be a second-order stochastic process. 
Definition 11.1.12 [f conditions (11.3) and (11.4) are satisfied for all s,t € T, 
the complex second-order stochastic process {X(t)}rer is called wide-sense station- 


ary. In continuous time (T =R or R,) this appellation is reserved for wide-sense 
stationary processes that have in addition a continuous covariance function. 


There exist stochastic processes that are wide-sense stationary but not strictly 
stationary (Exercise 11.6.1). 


Note that C(0) = 0%, the variance of any of the random variables X(t). 


As an immediate corollary of Theorem 11.1.9, we have: 


Corollary 11.1.13 Let {X(t)}ier be a wide-sense stationary stochastic process 
with mean m and covariance function C. Then 


E||X(t) — ml] < C(0)? 


and 


IC(7)| < CO). 


Recall the definition of the correlation coefficient p between two non-trivial real 
square-integrable random variables X and Y with respective means mx and my 
and respective variances 0% and of: 

cov (X,Y) 


p= ———_. 
OxOy 


The variable aX + b that minimizes the function F(a, b) := E [(Y — aX — b)”] is 


“a xy 
eer nS, ae 
ox 


and moreover 
x 2 
e|(?-¥) | = (1- p’) «2 
(Theorem 3.3.9). This random variable is called the best linear-quadratic estimate 
of Y given X, or the linear regression of Y on X. 
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For a WSS stochastic process with covariance function C’, the function 


is called the autocorrelation function. It is in fact, for any t, the correlation coeffi- 
cient between X(t) and X(t+7). In particular, the best linear-quadratic estimate 
of X(t+ 7) given X(t) is 


X(t+7\t) = m-+ p(r)(X(t) — m). 


The estimation error is then, according to the above, 


E (ze + r|t) — X(t 4 ”)" = 0% (1—~p(r)’) - 


In the continuous time case, this shows that if the support of the covariance 
function is concentrated around 7 = 0, the process tends to be “unpredictable” . 
We shall come back to this when we discuss the notion of white noise. 


EXAMPLE 11.1.14: HARMONIC PROCESS. Let {U;}x>1 be square-integrable 
centered random variables that are mutually uncorrelated. Let {®,};>1 be com- 
pletely random phases, that is, real random variables uniformly distributed on 
[0,27]. Suppose moreover that the U variables are independent of the ® vari- 
ables. Finally, suppose that )°°°., E[|Ux|?] < co. For all t € R, the series on the 
right-hand side of 


X(t) = NC U;, cos(Qr1,_t + Bx) , 
k=1 


where the 1,’s are arbitrary real numbers (frequencies), is convergent in the quadratic 
mean and defines a centered WSS stochastic process with covariance function 


C(t) = SD SEllUel' cos(2Q7V,T) « 


k=1 


(This stochastic process is called a harmonic process.) 


We first do the proof for a finite number N of terms, that is with X(t) = 
Soh, Ux cos(27v,t + ©). We then have 


N 
EB » U;, cos(271zt + By) 


k=1 


E|X(t)] 


> E[U, cos(2rvyzt + B,)] = yy E|U;|E[cos(27zt + ®;,)] = 0 


k=1 k=1 


426 


and 


The announced result then 
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ss UU; cos(2avy,(t + 7) + ®;,) cos(Qri¢t + Be) 
é=1 

E|U,U; cos(2a1,(t + T) + ®,) cos(2rvyet + By) 
E|U,U;|E|cos(2rvy,(t + 7) + ®,) cos(2riet + ®e)| 
|U;.["]E[cos(2714,(t + 7) + ®,) cos(271,t + &,)] 


U;,|2|E c (cos(271,(2t + T) + 2,) + cos(2r,7)) 


ollows since 


1 20 
E|cos(2rvyj,(2t + 7) + 20;)] = mall cos(271,(2t + T) + 2p) dy = 0. 
0 


The extension of this result to an infinite sum of complex exponentials is a straight- 
forward consequence of the result of Example 6.4.8. 


11.2 Gaussian Processes 


Brownian motion is a particular type of Gaussian process, which we now introduce. 


Gaussian processes are important for at least three reasons: 


(1) because of their mathematical tractability due in particular to the stability 


of the Gaussianity of 


stochastic processes: (a) by linear transformations 


(Theorem 3.4.5) and (3) by limits in the quadratic mean (see Theorem 7.4.5), 


(2) because of their ubiquity due to the many forms of the central limit theorem 


(Theorem 7.2.1), and 


(3) because the most imp 


ortant Gaussian process, Brownian motion, plays a 


fundamental role in the noise theory in communications and in mathematical 


finance. 
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Let T be an arbitrary index. 

Definition 11.2.1 The real-valued stochastic process {X(t)}ier ts called a Gaus- 

sian process if for all n > 1 and for all t,...,t, € T, the random vector 


(X(ti),...,X(tn)) 1s Gaussian. 


In particular, its characteristic function is given by the formula 


E |exp {i > oxy} = exp { 2 ujm(t;) — 7D > ujugl (t;, mt , 
j=l j=l j=l k=1 
(11.5) 
where w1,...,Un € R and where m and [ are the mean and covariance functions 
respectively. 


Theorem 11.2.2 For a Gaussian process with index set T = R or Z to be sta- 
tionary, it is necessary and sufficient that m(t) =m and I(t, s) = C(t—) for all 
3,t eT. 


Proof. The necessity is obvious, whereas the sufficiency is proven by replacing 
the t;’s in (11.5) by t¢ + h to obtain the characteristic function of 


(X(t, +h),...,X (ty +h) 


namely, 


exp o> ujm — 3 So ujurC(t; - io} ; 


j=l j=l k=1 


and then observing that this quantity is independent of h. 


EXAMPLE 11.2.3: CLIPPED GAUSSIAN PROCESS, I. Let {X(t)}ier be a centered 
stationary Gaussian process with covariance function Cy(r). Define the clipped 
(or hard-limited) process 

Y(t) = sign X(t), 
with the convention sign X(t) = 0 if X(t) = 0 (note however that this occurs 
with null probability if Cx (0) = 0% > 0, which is henceforth assumed). Clearly 
this stochastic process is centered. Moreover, it is unchanged when { X(t) }ier is 
multiplied by a positive constant. In particular, we may assume that the variance 
Cx(0) equals 1, so that the covariance matrix of the vector (X (0), X(7))” is 


r= os a 
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where px(rT) is the correlation coefficient of X (0) and X(r). We assume that I(r) 
is invertible, that is, |ax(7)| <1. 

We then have the Van Vieck-Middleton formula: 


at). 


2 
Cy(T) = = sin( 


Proof. Since for each t the random variable Y(t) takes the values +1 and 0, 
the latter with null probability, we can express the autocovariance function of the 
clipped process as 


Cy (rT) = 2{P(X(0) > 0, X(T) > 0) — P(X(0) > 0, X(7) < O)}, 
where it was noted that 
P(X(0) < 0, X(7) < 0) = P(X(0) > 0, X(r) > 0) 


and that 
P(X(0) < 0, X(7) > 0) = P(X(0) > 0, X(r) < 0). 


The result then follows from that of Exercise 3.6.33 with p = px(rT). 


The Wiener Process 


Definition 11.2.4 By definition, a standard Brownian motion, or standard Wiener 
process, is a continuous centered Gaussian process {W(t)}icr with independent 
increments, such that W(0) = 0, and such that for any interval [a,b] C R, the 
variance of W(b) — W(a) is equal to b—a. 


In particular, the vector (W(t1),...,W(t,)) with 0 <t <...< t, admits the 
probability density function 


1 : +( 
(V2m)* \/ty (to — th) +++ (te — tet) 


Note for future reference that for s,t € R,, 


2 : 2 
1 (epteg)? yy (ert top) 
1 tg-ty TT tet 


E|W(t)W(s)]| =tAs. (11.6) 
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In fact, for0< s <t, 


E[W(t)W(s)] = E[(W(t) — W(s))W(s)] + E[W(s)?] 
= El(W(t) — W(s))(W(s) — W(0))] + E[(W(s) — W(0))’] 
=0+s=tAs. 


We now give the description of the Wiener process as limit of a properly rescaled 
(both in time and amplitude) symmetric random walk. Let {Xn}n>0 be a sym- 
metric random walk on Z starting from 0, of the form 


Xn = 2 Zk ’ 
k=1 


where {Z,}n>1 is an IID sequence of {—1,+1}-valued random variables with 


P(Z, 1) = P(Z, = 1) = 4. Construct a continuous time stochastic pro- 


cess { X(t) }i50 from this sequence as follows: 


[t/A] 
X(t) = 6X ya) =5 SY Ze. 


k=1 


(Recall the notation |a] = sup{k € N;k < a}.) Since the Z;,’s are centered and of 
variance 1, we have that 


E[X()|=0, Var (X(t)) = (6)? x [t/AJ. 


Let A and 6 tend to 0 in such a way that the limit is not trivial. With re- 
spect to this goal, the choice A = 6 is not satisfactory since E [X(t)] = 0 and 
limyjo Var (X(t)) = 0, leading to a null process. If we take 6? = A, we have 
E(X(t)] = 0 and limajo Var (X(t)) = t. We show that in this case, for all 
t1,..-,tm in R, forming an increasing sequence, the limit distribution of the vector 
(X(t1),...,X(tm)) is that corresponding to a Wiener process. 


We consider the case m = 1, the general case being an easy adaptation. Let 
t, =t. In this case, since by the central limit theorem 


oa — N(0,1) 


we have, by Slutsky’s lemma (Theorem 7.1.6), 
X(t W/O) 2. TEI 
(Su a — N(0,1). 


vi ST VE 


Therefore, at the limit (in distribution), X(t) is a centered Gaussian variable with 
variance v/t. 
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Pathology 


Definition 11.2.4 of the Wiener process does not say much about the qualitative 
behavior of this process. Although the trajectories of the Brownian motion are, 
almost surely, continuous functions, their behavior is rather chaotic. First of all 
we observe that, for fixed tp > 0, the random variable 


W (to + h) — W (to) 
h 


and therefore it cannot converge in distribution as h | 0 since the limit of its 
characteristic function is the null function, which is not a characteristic function. 
In particular, it does not converge almost surely to any random variable. Therefore, 
for any tp > 0, 


~N (0,h7") 


P(t W(E) is not differentiable at to) = 1. 


But the situation is even more dramatic: 


Theorem 11.2.5 Almost all the paths of the Wiener process are nowhere differ- 
entiable. 


We shall not prove this result here,! but state one of its consequences. 


Corollary 11.2.6 Almost all the paths of the Wiener process are of unbounded 
variation on finite intervals. 


Proof. This is because any function of bounded variation is differentiable almost 
everywhere (with respect to Lebesgue measure).? 


The Brownian Bridge 


This is the process {X(t)}icjo,j obtained from the standard Brownian motion 
{W(t) }eeto.y by 
X(t) := W(t) -tW(1) (t € [0,])). 
It is a Gaussian process since for all t1,...,t € [0, 1], (X(t),...,X(th)) is Gaus- 
sian vector, being a linear function of the Gaussian vector (W (t,),...,W (tx), W(1)). 
In particular, since it is a centered Gaussian process, its distribution is entirely 
characterized by its covariance function and a simple calculation (Exercise 11.6.5) 
gives 
cov (X(t), X(s))=s(1-t) (0<s<t<1). 


' See for instance [7], Theorem 11.2.8. 
? See for instance Corollary 6, section 5.2 of [15]. 
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In particular, X (0) = X(1) =0. 


The Brownian bridge {X(t)}rejo,1) is distributionwise a Wiener process 
{W(t)}tefo,1) conditioned by W(1) = 0. This statement is problematic in that 
the conditioning event has a null probability. However, it is true “at the limit”: 


Theorem 11.2.7 Let f : R*  R be a bounded and continuous function. Then, 
for anyO <t) <te <---<t& <1, 
lim E [f(W(t),-- Weta) [WI Se] = EU/(X (4), XC) 


Proof. 


1),---,W(ts)) | |W(L)| < €] 

ELf(X(t) +t,W(1),...,X (te) +tW(1)) | |W) < €] 

_ El f(Xh) +hW(),.--, X(t) +eW())lwace] 
POW) <€) , 


E[f(w(t 


In view of the independence of {X(t)}iejo,1, and W(1) (Exercise 11.6.7), this last 
quantity equals 


fi? eo 2”E [f (X(t) + tiz,..., X(t) + tex)] dex 
Hes e732” dr 


which tends to E[f(X(t),...,X(tz))] as e | 0. 


Gauss—Markov Processes 


We now investigate another type of Gaussian processes, those having the additional 
property of being Markovian. We first give the general definition of a Markov 
process: 


Definition 11.2.8 Let T be R, orN. A real-valued stochastic process {X (t) bieris 
called a Markov process if for f : R — R that is non-negative or such that 


E||f(X@)I] < co (t € T), 
E[P(XE))IX(s), XC), X(t) = FLAX) X(s)] (11.7) 


for allO <t) < te < +++ <t,<s<t. 


Of course this definition fits the special case of Markov chains. 
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EXAMPLE 11.2.9: WIENER IS GAUSS-MARKOV. The Wiener process is a 
Gauss—Markov process (Exercise 11.6.9). 


EXAMPLE 11.2.10: A DISCRETE-TIME GAUSS—MARKOV PROCESS. A discrete- 
time stochastic process {Xn}n>o defined by Xn41 = @Xpn + Eni (n > 0), where 
{En}n>1 is an IID centered Gaussian sequence and Xo is a Gaussian random variable 
independent of this sequence, is a Gauss-Markov process (Exercise 11.6.9). 


The stochastic processes that are Gaussian and Markovian are in fact Wiener 
processes with a different time scale. The proof starts with a simple lemma. 


Lemma 11.2.11 Let {X(t)}iso be a centered Gaussian process with covariance 
function T’ such that T(t,t) > 0 for allt > 0. If in addition it is Markov, then for 
allt >s >t) > 0, 

l(t, s)I(s, to) 


ale 


(11.8) 


Proof. By the Gaussian property, the linear regression of X(t) on X (to) is equal 
to the conditional expectation of X(t) given X (to): 


T(t, to) 
T'(to, to) 


Using this remark and the Markov property, 


E[X(t)|X (to)] = X (to) - (x) 


E[X(t)|X(to)] = E[E [X(t)|X (to), X(s)] |X (t0)] 

I(t, s) 
T(s, s) 

I(t, s) ['(s, te) 

I(s, 5) (to, to) 


= B[E[X(t)|X(s)]|X(to)] = B X(s)|X (to) 


_ I(t, s) . _ 
= ESE LX (XH) 


X (to) : 


Comparing with the right-hand side of («), and since P(X (to) 4 0) > 0 (in fact 
= 1), we obtain (11.8). 


Theorem 11.2.12 Let {X(t)}is0 be a centered Gaussian process with continuous 
covariance function T such that 1(t,t) > 0 for allt > 0. It is Markov if and only 
there exist functions f,g: Ry — Ry such that for all s,t > 0 


T(t, s) = f(t V s)g(EA 5). (11.9) 
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Proof. 
Necessity. Suppose the process is Gauss—Markov. Let 
T(t, s) 
= ao 
? (I(s, s))? 
denote its autocorrelation function. By Eqn. (11.8), for allt > s > tp > 0, 
p(t, to) = p(t, s)p(s, to) . (xx) 


We show that p(t,s) > 0 for all t,s > 0. Indeed, assuming s > ¢ and using (xx) 
repeatedly, for all n > 1, 


= TIo( 2 =) 1+ EANG~ 9) 


and therefore, using the facts that p(u,u) = 1 for all u and that p is uniformly 
continuous on bounded rectangles, we can choose n large enough as to make all 
the elements in the above product positive. Therefore, we may divide by p(t, to) 
and write (xx) as 


_ p(t, to) 
p(t, 8) —_ p(s, to) 
7 ts T'(s,s)2 
I(t, s) = p(t, to) P(t, 6) oa 


from which we obtain the desired conclusion (here s =t/As andt=tVs). 


Sufficiency. Suppose that the process is Gaussian and that (11.8) holds true. 
Assume t > s. Therefore ['(t,s) = f(t)g(s). By Schwarz’s inequality, P(t, s) < 


T(t, t)2 (I(s, s))? or, equivalently, f(t)g(s) < (f(t)g(t t)f(s)g(s))?. Therefore, the 


function 


is monotone non-decreasing. In particular, the centered Gaussian process 


Y(t) = fOW 7) 
is Markov (because the Wiener process is Markov). Its covariance function is 
EY (t)¥(s v= F(t) F(s)E [W(r(t))W (7(s))] 
FOF(s)\(7 A 7(s)) 
= lds )r(s) = F(é)9(s). 
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Since it has the same covariance as {X(t)}:50 and since both processes are centered 
and Gaussian, they have the same distribution. In particular {X(t)}:>0 is Markov. 


Theorem 11.2.13 A wss Gaussian stochastic process {X (t)}is0 1s Markov if and 
only if its covariance function has the form 


C@)= Cex 


for some X> 0. 


Proof. If {X(t)}iso is wss, P(t, s) = C(t — s) and therefore, with p(t) := aon 


p(t +s) = p(t)p(s), 


which implies that p(t) = ce* for some a € R. Here c = 1 since p(0) = 1. Now 


p(1) = a =e*. But (Schwarz’s inequality) C(1) < 1 so that a < 0. 


11.3. The Wiener—Doob Integral 


The Doob stochastic integral, a special case of which is the Wiener stochastic 
integral 


/ f(t) aW(t) () 


that is defined for a certain class of measurable functions f, is not of the usual 
types. For instance, it cannot be defined pathwise as a Stieltjes—-Lebesgue integral 
since the trajectories of the Brownian motion are of unbounded variation (Corol- 
lary 11.2.6). Nor can this integral be interpreted as J, f(t)W(t) dt (where the dot 
denotes derivation), since the Brownian motion sample paths are not differentiable 
(Theorem 11.2.5). 


The integral in (x) will therefore be defined in a radically different way. In 
fact, the Doob stochastic integral will be defined more generally, with respect to 
a process with centered and uncorrelated increments. 


Definition 11.3.1 Let {Z(t)},<R be a complex stochastic process such that for all 
intervals [t,, t2] C R the increments Z(t2) — Z(t,) are in L2,(P), centered and such 
that for some locally finite measure yu on (IR, B(R)): 


E\(Z(t2) — Z(t))(4(ta) — 2(ts))"] = w(t, ta] 9 (ts, ta]) 
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for all [t),t2] C R and all [t3,ta] C R. Such stochastic process {Z(t)}icp 1s 
called a stochastic process with centered and uncorrelated increments, and [1 ts tts 
structural measure. 


In particular, if the intervals (t;, t2] and (ts,t4] are disjoint, Z(t2) — Z(t) and 
Z(t4) — Z(t3) are orthogonal elements of the Hilbert space L7,(P). 


EXAMPLE 11.3.2: WIENER PROCESS. The Wiener process {W(t) },cp is a pro- 
cess with centered and uncorrelated increments whose structural measure is the 
Lebesgue measure. 


Gaussian Subspaces 


Before proceeding to the construction of the Wiener—Doob integral, the definition 
of a Gaussian subspace is necessary. 


Definition 11.3.3 Let {X;}ier be an arbitrary collection of complex (resp., real) 
random variables in L4(P) (resp., LR(P)). The Hilbert subspace of L3,(P) (resp., 
L2.(P)) consisting of the closure of the vector space of finite linear complex (resp., 
real) combinations of elements of {X;}ier 1s called the complex (resp., real) Hilbert 
subspace generated by {X;}ier, and is denoted by Ho(Xi,i € I) (resp., 
Ar(Xi,2 € I). 


More explicitly, in the complex case for instance: the Hilbert subspace 
Ho (Xi,i € I) C L2.(P) consists of all complex square-integrable random vari- 
ables that are limits in the quadratic mean (that is, limits in Lj,(P)) of some 
sequence of finite complex linear combinations of elements in the set {X;};e7. 


Definition 11.3.4 A collection {Xi}ier of real random variables defined on the 
same probability space, where I is an arbitrary index set, is called a Gaussian 
family if for all finite set of indices i,,...,i, € I, the random vector (Xj,,...,X;i,) 
is Gaussian. A Hilbert subspace G of the real Hilbert space L%(P) is called a 
Gaussian (Hilbert) subspace if it is a Gaussian family. 


Theorem 11.3.5 Let {X;}ier be a Gaussian family of random variables of L2(P). 
Then the Hilbert subspace Hp(X;,i € I) generated by {X;j}icr is a Gaussian sub- 
space of L2,(P). 


Proof. By definition, the Hilbert subspace Hp(X;,2 € J) consists of all the 
random variables in L%(P) that are limits in the quadratic mean of finite linear 
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combinations of elements of the family {X;}iey. The result follows from that in 
Example 7.4.5. 


Construction of the Wiener—Doob Integral 


This integral is defined for all integrands f € L7,(w) in the following manner. 
First, we define it for all f € £, the vector subspace of L4,(j) formed by the finite 
eotniplex linear combinations of interval indicator functions 


N 
f(t) = ~ O41 (a;,b5] (t) 


For such functions, by definition, 


[se ) dZ(#) = Satan) Z(aj)). 


Observe that this random variable belongs to the Hilbert subspace Hc(Z) of L2(P) 
generated by {Z(t)}:er. One easily verifies that the linear mapping 


yp: feces | saz(n) ei) 


is an isometry, that is, 


forma elf 


Since L is a dense subset of Lz,()?, y can be uniquely extended to an isometric 
linear mapping of L2,(j) into He(Z) (Theorem A.0.6). We continue to call this 
extension y and then define, for all f € L2(j), the Doob integral of f with respect 


to {Z(t)}ier by 


| f(t) Z(t) = gf). 
R 


The fact that y is an isometry is expressed by the Doob isometry formula 


e[([ saze)(f aoaze)| = [roman rao) 


3 The proof is not obvious. See for instance Theorem 9.4 of Théorie de l’intégration, M. 
Biane and G. Pagés, Vuibert, Paris, 2004. 
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where f and g are in Li,(w). Note also that for all f € L2,(y): 


E [0 azo =0, (11.11) 


since the Doob integral is the limit in L2,(j2) of random variables of the type 
ae a;(Z(b;) — Z(a;)) that have mean 0 (use the continuity of the inner product 
in L2(P}). 


A Formula of Integration by Parts 


Theorem 11.3.6 Let {W(t)}ier, be a standard Wiener process and denote by 
Ap(W) i Gaussian real Hilbert space that it generates. The Wiener integral 
he = Jhcdhk t)dW(t), where f € LR(R+), is characterized by the following two 
properties: 


(a) Y € Hr(W); 
(b) E[YW(s)] = Jo f(d) dt for all s > 0. 


Proof. Necessity: We have already noted that, by construction, Jn, f t)dW(t) € 
Hp(W). As for (b), this is just the isometry formula 


B| f Feoawe f 1ys<1} dW (t) (| = [sea 


Sufficiency: Since Y — ia s)dW(s) is in Hp(W), it suffices to show that this 
random variable is orthogonal cs the generators W(s), s © R,, of Hp(W) and 
therefore is the null element of Hy(W), and therefore Y = fis f(s)dW(s), P-a.s. 
But, by (b) and, again, by the isometry formula, 


E|(¥- MeO «iV (u)) wis) -[ 10 ar— fi swat = 


Theorem 11.3.7 Let {W(t)},cp be a standard Wiener process. Let T be a posi- 
tive real number and let f : [0,T] > R be a continuously differentiable function. In 
particular f € L%((0,T]) and therefore the integral f f@ dW (t) is well defined. 
Then: 


i f(t) aW(t) + [ f(QW(b at = f(T)W(T). 
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Proof. By Theorem 11.3.6, it suffices to prove that for all s € [0,7], 


B| (sewer) -f re W(t) W(s)] = [ toa 


which, using the equality E [W(a)W(b)| = min(a, b), reduces to 


( [ POW} av) wis) = | * plé)at. 
e[(f ro Powe a) W o) = | " PEW HW(s)] at 


T 
-| f(t) min(t, s) dt 
0 
We therefore have to check that 
s oa s 
(ris f fitted —s | rnar= f f(t)dt, 
T)s— “yn dt — s(f(T = : d 
)s— f reorar-svr)— (8) = f° roe 
- fst Jtdt + sf(s )= [ tear, 


which follows by integration by parts. 


By Fubini: 


or 


But this is 


11.4 Two Applications 
Langevin’s Equation 


Definition 11.4.1 Let {W(t)},cp be a standard Wiener process, and let for all 
teR 


X(t) = (2a)? / © eralt-s) o dW(s), 


where a > 0 anda > 0. The process {X(t)}icR defined in this way is called the 
Ornstein—Uhlenbeck process. 
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Since for allt € R, X(t) belongs to Hp(W), it is a Gaussian process (‘Theorem 
11.3.5). It is centered, with covariance function 


T(t, s) =e Al , 
as follows directly from the isometry formula (11.10). 


Definition 11.4.2 The Langevin equation is, by definition, the equation 
dV (t) + aV(t) dt = odW(t) 


to be interpreted as 


Theorem 11.4.3 The unique solution of the Langevin equation with initial value 
V(0) is 


V(t) =e “V(0) + i e %—3)g dW(s). 


In particular, with the choice V(0) = fe 


oe” dW(s), 


t 
V(t) =| e 9) ¢ dW(s) 
is the Ornstein—Uhlenbeck process. 


Proof. Using the integration by parts formula of Theorem 11.3.7, the Langevin 
equation is found to be equivalent to 


t 
VQ =ee'voy row f ac*oW(s) as, (*) 
0 
By the (classical) formula of integration by parts, 


U U t U 
co | et oW(s) ds = -{ e@ (/ eT*oW(s) as) dt + a W(t) dt. 
0 0 0 0 


Therefore, integrating both sides of (x) from 0 to u 


of V(t) dt = (l—e )V(0) + a ae") gW(s) ds 
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and finally: 
V(u) — V(0) + of V(t) dt 
0 


=V(u)—e ““V(0) + [ ae~*"-) gW(s) ds = oW(u). 


We now prove unicity. Let V’ be another solution with the same initial value. 
With U := V — V’, we therefore have 


t 
U(t) = of U(s)ds, 
0 
whose unique solution is the null function, by Gronwall’s lemma: 


Lemma 11.4.4 Let xv: Rt > R be a positive locally integrable real function such 
that 


t 
x(t)<at vf x(s) (x) 
0 
for somea>0 andb>0. Then 


x(t) <ae™. 


Proof. Multiplying (x) by e~", we have 
t 
e x(t) < ae ™ + bere f x(s)ds 
0 


or, equivalently 


Integrating this inequality: 


t 
aa =e") > ot f x(s)ds. 
b 0 


Substituting this into (x): 
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The Cameron—Martin Formula 


This result is of interest in communications and detection theory. One will recog- 
nize the likelihood ratio associated with the hypothesis “signal plus white Gaussian 
noise” against the hypothesis “white Gaussian noise only”. 


Theorem 11.4.5 Let {X(t)}is0 be, with respect to probability P, a Wiener process 
with variance a? and let y:R— R be in L2(0). For any T € Ry, the formula 


= = exx{le vWax()-3 So 7 at} (11.12) 


defines a probability measure Q on (Q,F) with respect to which 


X(t) — [owe 


is, on the interval [0,T], a Wiener process with variance o?. 


The proof of Theorem 11.4.5 is based on the following preliminary result. 


Lemma 11.4.6 Let {X(t)}iso be a Wiener process with variance o? and let y : 
R—R be in La(£). Then, for anyT € R,, 


E [ele naxe) = 92? So Fat | (11.13) 


Proof. First consider the case 
N 
g(t) = So onleazo,)(t) (11.14) 
k=1 
where a; € R and the intervals (ax, by] are disjoint. For this special case, formula 
(11.13) reduces to 


E [ext mc = 27 Dhar % (bean) 


and therefore follows directly from the independence of the increments of a Wiener 
process and from the Gaussian property of these increments, in particular, the 
formula giving the Laplace transform of the centered Gaussian variable X (b)—X (a) 
with variance 0?(b — a): 


E [er *O-X(@))] = e237 a (b—a) : 
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Let now {Yn}n>1 be a sequence of functions of type (11.14) converging in L2(¢) 
to y (in particular, limytoo ie 2 (t)dt = i y?(t)dt). Therefore, 
T T 
tim f pa(naxe) = f° etaxio, 
0 


ntoo 0 


where the latter convergence is in L2(P). This convergence can be assumed to 
take place almost surely by taking if necessary a subsequence. From the equality 


E [el potent] = e27 So PR (Oat 


we can then deduce (11.13), at least if the sequence of random variables in the 
left-hand side is uniformly integrable. This is the case because the quantity 


E [ elo Pn(t)ax (t) ' sit [e? fe oe) — 920? fo eR (tat 


is uniformly bounded, and therefore the uniform integrability claim follows from 
Theorem 6.5.5, with G(t) = ??. 


We may now turn to the proof of Theorem 11.4.5. 


Proof. The fact that (11.12) properly defines a probability Q, that is, that the 
expectation of the right-hand side of (11.12) equals 1, follows from Lemma 11.4.6 
with y(t) = 4y/(t). 


o2 


Letting 
Y(}= XW) f a(sias, 


we have to prove that this centered stochastic process is Gaussian. To do this, we 
must show that 


Eg [ete an(¥G)-Ve)] = oho? Dh a2 (by —ax) 


? 


where a; € R and the intervals (ax, b,| C [0,7] are disjoint, that is, letting y(t) = 
N 
25 AKL (ax,bs] (é), 
Eg fel” Hoaro] = eho? r Pode 


or equivalently, 


Ep Fa eee = 8 Se Oat 
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that is, 


Ep fer {ir (AX ()—§ fo Pat} oh HOO 12H] = oho? fo W(at | 
Simplifying: 

Ep le (vO)+r))ax(O-Sp (Ov) at—§ [or cad = 37 Jo Yat 
and using (11.13) with y(t) = W(t) + +77(t), the left-hand side is equal to 


Ep et Ia (o@+ Ap) at—f7 (Ww) at—4 7 hay 


The proof is completed since 


11.5 Fractal Brownian Motion 


The Wiener process {W(t)}:50 has the following property. If c is a positive con- 
stant, the process {W.(t)}is0 := {c72W/(ct)}i50 is also a Wiener process. It is 
indeed a centered Gaussian process with independent increments, null at the time 
origin, and for 0 <a< 8, 


E [|W.(b) — W-(a)|?] = c7'E [|W (cb) — W(ca)|?] = (cb — ca) = b - a2. 
This is a particular instance of a self-similar stochastic process. 


Definition 11.5.1 A real-valued stochastic process {Y (t)}:>0 is called self-similar 
with (Hurst) self-similarity parameter H if for any c > 0, 


{Y(O}ao © {7 V(c)}oo, 


where the symbol R means “have the same distribution”, or “have the same finite- 
dimensional distribution”, depending on the context. 
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The Wiener process is therefore self-similar with similarity parameter H = 3. 


It follows from the definition that Y(t) R t“Y (1), and therefore, if P(Y(1) 4 
0) > 0: 


If H <0, Y(t) — 0 in distribution as t > oo and Y(t) > oo in distribution as 
t+ 0. 


If H > 0, Y(t) > ov in distribution as t + 0 and Y(t) > 0 in distribution as 
t + oo. 


If H = 0, Y(t) has a distribution independent of t. 


In particular, when H ¥ 0, a self-similar process cannot be stationary (strictly 
or in the wide sense). 


We shall be interested in self-similar processes that have stationary increments. 


Theorem 11.5.2 Let {Y(t)}is0 be a non-negative self-similar stochastic process 
with stationary increments and self-similarity parameter H > 0 (in particular, 
Y(0) = 0). Its covariance function is given by 


Go) = cov OF(8), 7) = 5° [7 —|t—sP% + 5°"), 


where o? = E[(Y(t+ 1) — Y(¢))?] = E[Y(1)’]. 


Proof. Assume without loss of generality that the process is centered. Let 0 < 
s<t. Then 


E [(¥(t) — ¥(s))"] =E[@(é— s) — ¥(0))”] 
=E[(Y(t—s))?] =o07(t — s)?* 
and 


2E [Y (t)¥(s)] = E[Y(t)’] + E[¥(s)"] -E[(¥() — ¥(s))'] , 


hence the result. 


The fractal Brownian motion is a Gaussian process that in a sense generalizes 
the Wiener process. 


Definition 11.5.3 A fractal Brownian motion on Ry with Hurst parameter H € 
(0,1) is a centered Gaussian process {By(t)}i>0 with continuous paths such that 
Br(0) =0, and with covariance function 


E([By(t)Bu(s)| (¢|?" + |s)?% — |t — s|?”). (11.15) 


i 
2 
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We shall prove the existence of the fractal Brownian motion by constructing it 
as a Wiener integral. More precisely, define for 0 < H < 1, wy(t,s) :=0 fort <s, 


walt, s) = (t- 3) #3 for0O<s<t 


and 


wy(t,s) := (t — s)#-2 — (—s)#-3 for s < 0. 
Observe that for any c > 0 


AH 


wa(ct, s) =c —2wy(t, se"). 


Define 
Br(t) := [ents dW (s). 


The Wiener integral of the right-hand side is, more explicitly, 
7 1 0 1 1 
A-B:= | (t — s)#-2 dW(s) — / (( — s)#-3 — (-s)#-4) dW (s). (11.16) 
0 —oo 


1 


It is well defined and with the change of variable u = c”*s it becomes 


4 L wu(teu) aw (eu), 
R 


Using the self-similarity of the Brownian motion, the process defined by the last 
display has the same distribution as the process defined by 


cH-3¢3 il wa(t, u) dW (u). 
R 


Therefore {By (t)}i50 is self-similar with similarity parameter H. 
It is tempting to rewrite (11.16) as Z(t) — Z(0), where 


t 


2) = | (t — s)#-2 dW(s). 


—co 


However this last integral is not well defined as a Doob integral since for all H > 0, 
1 
the function s > (t — s)"~21,,< is not in L3(R). 
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11.6 Exercises 


Exercise 11.6.1. WIDE-SENSE STATIONARY, BUT NOT STATIONARY 

Give a simple example of a discrete-time stochastic process that is wide-sense 
stationary, but not strictly stationary. Do the same for a continuous-time wide- 
sense stationary process. 


Exercise 11.6.2. CLOSE RELATIVES OF THE BROWNIAN MOTION 
Let {W(t)}is0 be a standard Brownian motion. What can you say about the 
process aa ) } re{o,1), where: 


A. X(t) =tW (+) with X(0) := 0? (You will admit continuity of the process at 
time ‘ 


B. X(t) =W(1)-W(1-2)? 


Exercise 11.6.3. SQUARED BROWNIAN MOTION 
1. Show that for a Brownian motion {W(t)}:s0, 


E [|W(t) — W(s)|"] = 3]¢— |’ 
2. Let {X(t)}:er be a centered wide-sense stationary Gaussian process with covari- 


ance function Cy. Compute the probability that X(t,) > X(t2) where t),t2 © R 
are fixed times. 


3. Give the mean function and the covariance function of the process {X (t)*}ier. 
Exercise 11.6.4. CONTINUITY OF THE COVARIANCE FUNCTION 

Prove that for the covariance function of a complex wide-sense stationary process 
{X(t)},59 to be continuous, it suffices that it be continuous at the origin, and that 


this is in turn equivalent to continuity in the quadratic mean of the stochastic 
process, that is, for all R, 


lim E [|X(t+h) — X(#)|7] =0. 
h-0 
Show that in fact, the covariance function is then uniformly continuous on R. 


Exercise 11.6.5. A BASIC FORMULA 
Let {W(t)}is0 be a standard Wiener process. Prove that for s,t € R;, 
E[W(t)W(s)] =tAs. 
Let {Y(¢)}:s0 be a Brownian bridge. Prove that 
cov (X(t), X(s))=s(1-t) (0<s<t<1). 
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Exercise 11.6.6. A REPRESENTATION OF THE BROWNIAN BRIDGE 
Let {W(t)}is0 be a standard Brownian motion. Let for ¢ € [0,1), 


Y(t) := a—» f as) 


(i) Prove that the integral in the right-hand side is well defined on [0,1) as a 
Wiener integral. 


ds. 


(ii) Prove that as t | 0, Y(t) > 0 in quadratic mean. 
(iii) Define Y (0) := 0. Show that {Y(t)}+ejo,1) is a Gaussian process. 


(iv) Show that {Y(t)}:ejo,1) is (has the same distribution as) a Brownian bridge. 


Exercise 11.6.7. BROWNIAN BRIDGE 
Let {W(t)}rejo,1; be a Wiener process. Show that the Brownian bridge 


{X(t) = W(t) — tW(1) repo. 


is a Gaussian process independent of W(1) and compute its autocovariance func- 
tion. Show that the process {X (1 — t)}rejo,1 is a Brownian bridge. 


Exercise 11.6.8. STRUCTURAL MEASURE 

Let {Z(t)}:eg be a second-order real-valued centered stochastic process, right- 
continuous in the quadratic mean, such that Z7(0) = 0 and with uncorrelated 
increments (for alla <b <c< d, we have that FE [(Z(b) — Z(a))(Z(d) — Z(c))| = 
0). Show that there exists a locally finite measure js on (IR, B(R)) such that 


E [(Z(b) — Z(a))"] = n((a, 0). 


Exercise 11.6.9. SOME GAUSS-MARKOV PROCESSES 
A. Show that the Wiener process is a Gauss—~Markov process. 


B. Show that a discrete-time stochastic process {X,,}n>1 defined by Xn41 = aXn+ 
Enyi (n > 0), where {€,}n>1 is an UD centered Gaussian sequence and Xo is 
a Gaussian random variable independent of this sequence, is a Gauss—~Markov 
process. 


C. For each t > 0, let X(t) = a(t)W(r(t)), where {W(t)}:50 is a standard Wiener 
process, a: R, — R and 7: R, — R are measurable functions, and moreover 
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T is strictly increasing, with 7(0) = 0. Prove that {X(t)}:s0 is a Gauss-Markov 
stochastic process and give explicitly the functions f and g of Theorem 11.2.12. 


Exercise 11.6.10. THE ORNSTEIN-UHLENBECK PROCESS. 
Let 
X(t) :=e “W(e'™) (t>0) 


where {W(t)}re>0 is a standard Wiener process and a is a positive real number. 
Prove that {X(t)}es0 is an Ornstein-Uhlenbeck process. 


Exercise 11.6.11. ORNSTEIN-UHLENBECK IS GAUSS-MARKOV 
Show that the Ornstein—Uhlenbeck process is a Gauss-Markov process. Describe 
the functions f and 7 in its representation as 


Exercise 11.6.12. MICROPULSES AND FRACTAL BROWNIAN MOTION. 

Let N, be a Poisson process on R x R, with the mean measure v(dt x dz) = 
saz! dt x dz, where 0 < 6 < lande > 0. For all t > 0, let Sf, = 
{(s,2z):0<s<t,t—s<z} and Sj, = {(s,z):—oo<s<0,-s<z<it—s}, 
and define* 


X,(t) =e {N.(S$4) = N.(S5) } : 


(1) Show that X_(t) is well defined for all t > 0. 


(2) Compute for all 0 < t < tg < +--+ < t, the characteristic function of 
(X.(t1),...,Xe(tn)). 

(3) Show that for all 0 < t) < te < +++ < th, (X.(t),..., Xc(tn)) converges 
in distribution to (By(t1),...,Bu(tr)) as € | 0, where {By(t)},., is a fractal 
Brownian motion ({Bm) with Hurst parameter H = 458 and variance E [By(1)?] = 
é-'(1—6)~*. Recall that {By(t)},., is called an {Bm with Hurst parameter H, 


0<H< 3, if it is a centered Gaussian process such that By(0) = 0 with covariance 
function 


E[Bu(t)Bu(s)] = 5 (Is? eR! —|s — 424) B [By (1). 


+ R. Cioczek-Georges and B.B. Mandelbrot, A class of micropulses and antipersistent fractal 
Brownian motion, Stochastic Processes and their Applications, 60, pp. 1-18, (1995). 


Check for 
updates 


Chapter 12 


Wide-sense Stationary Processes 


This chapter concerns a topic of interest in many fields of application, most notably 
signal processing and communications theory, as well as econometrics and the 
earth sciences. The main notion here is that of power spectrum (power spectral 
measure). 


12.1 The Power Spectral Measure 


As we shall now see, the classical Fourier analysis of square-integrable (with respect 
to Lebesgue measure) functions has a counterpart in the theory of wide-sense 
stationary processes. 


Consider first a WSS stochastic process { X(t) }ier with integrable and continu- 
ous covariance function C. The Fourier transform f of this covariance function is 
therefore well defined by 


f(v) = ferret) dr. (12.1) 


It is called the power spectral density (PSD). It turns out that it is non-negative 
and integrable, as we shall soon see. Since it is integrable, the Fourier inversion 
formula 


C(r) = [errte) dv (12.2) 


holds almost everywhere, and in fact everywhere since both sides of the equality 
are continuous (Example 4.1.24). In the context of Wss stochastic processes, 
(12.2) is called the Bochner formula. Letting 7 = 0 in this formula, we obtain, 


' See for instance [5], Theorem 1.1.8. 
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since C(0) = Var (X(t) := 07, 


a= [ flv)av. (12.3) 


EXAMPLE 12.1.1: THE ORNSTEIN-UHLENBECK PROCESS. Let {X(t)}es0 be 
an Ornstein—Uhlenbeck process. It is a centered Gaussian process, and, using 
(11.6), we have for t > s, 


E[X (t)X(s)] = Ele~“ We?" )e-“ W (€?*)] 
= ot) BIW (NW) 


—a(t+s) 2at easy 


=e min(e““’e 


—a(t—s) 


= eT alt+s) 20s 


=e 
and therefore, for all s,t € Ry, 
E[X(t)X(s)| = e7-#l . 


It is therefore a WSS stochastic process with integrable covariance function, and 
its power spectral density is then the Fourier transform of the covariance function: 


2a 
a? + Aq2p2 * 


f(v) _ | eo 27 g—altl qr _ 
R 


Not all wss stochastic processes admit a power spectral density. For instance, 
consider a wide-sense stationary process with a covariance function of the form 


C= s ner, (12.4) 
keZ 
where 
Py > 0 and S> Py < 00 (12.5) 
keZ 


(say, the harmonic process of Example 11.1.14). Clearly, this covariance function 
is not integrable, and in fact there does not exist a power spectral density. In par- 
ticular, a representation of the covariance function such as (12.2) is not available, 
at least if the function f is interpreted in the ordinary sense. However, there is 
a formula such as (12.2) if we consent to define the PSD in this case to be the 
pseudo-function 

fv) = >> ed(v-%), (12.6) 


keZ 
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where 0(v — a) is the delayed Dirac pseudo-function informally defined by 


[eeu -aav= ota. 
Indeed, with such a convention, 
i f(vje2"" f(v) dv = S- Py i. 7 §(y — 1%) dv = S- Proritrer 
. keZ R keZ 
The General Case 


Remember that the characteristic function y of a real random variable X has the 
following properties: 


A. it is hermitian symmetric, that is, y(—u) = y(u)*, and it is uniformly 
bounded: |y(u)| < y(0), 


B. it is uniformly continuous on R, and 


C. it is definite non-negative, in the sense that for all integers n, all uw, . 
Un € R, and all z,..., 2, € C, 


S- x plu; — Un)z;z% > 0 


j=l k=1 


ney 


2 
(just observe that the left-hand side equals E ie : z,0i0*| | \ 


It turns out that Properties A , B and C characterize characteristic functions up 
to a multiplicative constant. This is the content of Bochner’s theorem (Theorem 
7.1.7), which is now recalled for easier reference: 


Let y : R — C be a function satisfying properties A, B and C. Then there 
exists a constant 0 < 6 < co and areal random variable X such that for all u € R, 


y(u) = BE [e™*] . 
Bochner’s theorem is all that is needed to define the power spectral measure 
of a wide-sense stationary stochastic process continuous in the quadratic mean. 


Theorem 12.1.2 Let {X(t)}ren be a WSS stochastic process continuous in the 
quadratic mean, with covariance function C. Then, there exists a unique measure 
fi on R such that 


C(r) = | eT (dv). (12.7) 
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In particular, jz is a finite measure: 
p(R) = C(0) = Var (X(0)) < oo. (12.8) 
Proof. It suffices to observe that the covariance function of a WSS stochastic 


process that is continuous in the quadratic mean shares the properties A, B and 
C of the characteristic function of a real random variable. Indeed, 


(a) it is hermitian symmetric, and |C(r)| < C(0) (Schwarz’s inequality), 
(b) it is uniformly continuous, and 


(c) it is definite non-negative, in the sense that for all integers n, all ™, ..., 
Tm €R, and all 74, ..., 2, EC, 


y 3 oe Tj — Th) 22 = 0 


j=l k=1 


2 
(just observe that the left-hand side is equal to EF pa 25X (t) ). 


Therefore, by Theorem 7.1.7, the covariance function C' is up to a multiplica- 
tive constant a characteristic function. This is exactly what (12.7) says, since ju 
thereof is a finite measure, that is, up to a multiplicative constant, a probability 
distribution. 


Uniqueness of the power spectral measure follows from the fact that a finite 
measure (up to a multiplicative constant: a probability) on R is characterized by 
its Fourier transform (Theorem 5.3.2). 


Special Cases 


The case of an absolutely continuous spectrum corresponds to the situation where 
pe admits a density f with respect to Lebesgue measure: yu(dv) = f(v)dv. (In 
particular, f is non-negative and integrable with respect to Lebesgue measure.) 
As we saw before, we then say that the wss stochastic process in question admits 
the power spectral density (PSD) f. 


The case of a “line spectrum” corresponds to a spectral measure that is a 
weighted sum of Dirac measures: 


pu(dv) ee Pr Ev, (dv) 
keZ 


Since jz is a measure, the P,’s are non-negative, and since yz is a finite measure, 
they have a finite sum. 
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12.2 Filtering of wss Stochastic Processes 


We recall a few standard results concerning the (convolutional) filtering of deter- 
ministic functions. 


Let f,g : (R, B(R)) — (R, B(R)) be integrable functions with respective Fourier 
transforms f and g. Then (Exercise 4.5.12), 


f fre sya ides, 


and therefore, for almost all t € R, the function s + f(t — s)g(s) is Lebesgue 
integrable. In particular, the convolution 


(f* a(t i= f ft- sols) 


is almost everywhere well defined. For all t such that the last integral is not defined, 
set (f * g){t ) =0. Then f * g is Lebesgue integrable and its Fourier transform is 


f*eg= fo g, where Z g are the Fourier transforms of f and g, respectively (Exercise 
4.5.13). 


Let fh : (R,B(R)) > (R,B(R)) be an integrable function. The operation that 
associates to the integrable function x : (R,B(R)) — (R,B(R)) the integrable 
function 


y(t) = [ne — s)a(s) ds 


is called a stable convolutional filter. The function h is called the impulse response 
of the filter, and x and_y are respectively the input and the output of this filter. 
The Fourier transform h of the impulse response is the transmittance of the filter. 


Let now {X(t)}:eg be a WSs stochastic process with continuous covariance 
function Cx. We examine the effect of filtering on this process. The output 
process is the process defined by 


Y(t) := [ue — s)X(s)ds. (12.9) 


Note that the integral (12.9) is well defined under the integrability condition 
for the impulse response h. This follows from Theorem 11.1.10 according to which 


the integral 
[rex s,w) ds 
R 
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is well defined for P-almost all w when f is integrable (in the special case of 
WSS stochastic processes, m(t) = m and T(t, t) = C(0) + |mJ?, and therefore the 
conditions on f and g thereof reduce to integrability of these functions). Referring 
to the same theorem, we have 


| [ f()X(t) dt] = [ f()E[X(t)] dt =m : f(t) de. (12.10) 


Let now f,g : R— C and be integrable functions. As a special case of Theorem 
11.1.10, we have 


cov (froxwae, [x1 Jas) = [ [tos C(t—s)dtds. (12.11) 


We shall see that, in addition, 


cov (froxwee, fs (s)X (s)as) = [ [ic (—v)u(dv). (12.12) 


Proof. Assume without loss of generality that m= 0. From Bochner’s represen- 
tation of the covariance function, we obtain for the last double integral in (12.11) 


[ [tov (Laws nav) Sete 
7 | (fF (persona) (Js (s)enms as) dr) | 


Here again we have to justify the change of order of integration using Fubini’s 
theorem. For this, it suffices to show that the function 


(t,5,v) + [f(Og*(s)e™"™ | = | F(O)| 19(s)| Ia) 


is integrable with respect to the ane measure ¢ x £ x yp. This is indeed true, 
the integral being equal to (J, | f(t)| dt) x (Je |g(t)| dt) x u(R). 


In view of the above results, the right-hand side of formula (12.9) is well defined. 
Moreover 


Theorem 12.2.1 When the input process {X(t)}ier is a WSS stochastic process 
with power spectral measure ux, the output {Y(t)}ter of a stable convolutional 
filter of transmittance h is a WSS stochastic process with the power spectral measure 

py (dv) = |A(v)|?ux(dv). (12.13) 


This formula will be referred to as the fundamental filtering formula. 
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Proof. Just apply formulas (12.10) and (12.12) with the functions 
fu) =hE-w), — g(v) = h(s—2) 


to obtain 
EY (t)] = mf h(t)dt , 


and 


E\(Y (t) — m)(¥(s) — m)"] = | la) Pete) u(dv) 


EXAMPLE 12.2.2: TWO SPECIAL CASES. In particular, if the input process 
admits a PSD fx, the output process also admits a PSD given by 


fv) = [h(v) 2 fx(v) dv. 


When the input process has a line spectrum, the power spectral measure of the 
output process takes the form 


pry (dv) = > Pylh(vx)|ev, (dv) . 


k=1 


White Noise 


By analogy with Optics, one calls white noise any centered WSs stochastic process 
{B(t)},cg with constant power spectral density fg(v) = 1. Such a definition 
presents a theoretical difficulty, because 
+oo 
fa(v) dv =+o0, 


—co 


which contradicts the finite power property of wide-sense stationary processes. We 
have therefore to find other ways to deal with white noise. 


Heuristics I: The Large Flat Spectrum Approach 


From a pragmatic point of view, one could define a white noise to be a centered 
WSS stochastic process whose PSD is constant over a “large”, yet bounded, range 
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of frequencies [—A,+A]. The calculations below show what happens as A tends 
to infinity. Let therefore {X(t)},<p be a centered wss stochastic process with PSD 


fv) = lea +a) - 


Let 1, ~2 : R > C be two functions in L4,(R)NL2(R) with Fourier transforms 
(1 and G2, respectively. Then 


am B[(fetoxear)(f aoxwar) |= [ amewa 


- | B(v)B4(v) dv. 


Proof. We have 


El(f etoxwat) (fmmxwar) |= ff cxtareatoyrextu-v) dude. 


The latter quantity is equal to 


+00 +A ; 
[i extareatoy (fo eer ar) auc 
os A 
+A +00 : +00 
= / (/ yi(u)e™™™™ au) (/ (po(v)*e A” aw) dv 
—A —oo —oo 


aI aia aa 


A 


and the limit of this quantity as A t oo is: 


[- Pi(v)P3(v) dv = i vox (t)ipa(t)* dt, 


CoO —oo 


where the last equality is the Plancherel—Parseval identity. 


Let now h: R > C be in LE (R)M Li (R), and define 
Y(t) = | h(t — s)X(s)ds. 
R 


Applying the above result with yi(u) = h(t—u) and yo(v) = h(t+7 —v), we find 
that the covariance function Cl of this WSS stochastic process is such that 


lim Cy(r) = fi eT h(v)|? dv. 
Atoo R 
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The limit is finite since h € L2.(R) and is a covariance function corresponding to 
a bona fide (that is, integrable) PDF fy(v) = |h(v)|?. With f(v) = 1, we formally 
retrieve the usual filtering formula, 


fr(v) = |h) PF). 


Heuristics I: The Approximate Derivative Approach 


Here, we consider the white Gaussian noise. The heuristic approach in this 
case substitutes for {B(t)},.p, the “finitesimal” derivative of the Brownian motion 


W(t+h)— W(t) 
ee 


For fixed h > 0 this defines a proper WSS stochastic process centered, with covari- 
ance function 


B,(t) = 


(h = |rl)* 
Olt) = a 
and (Exercise 12.5.6) power spectral density 
sin rvh \* 
fav) = ( 3 ) ‘ (12.14) 


Note that, as h | 0, the power spectral density tends to the constant func- 
tion 1, the power spectral density of the “white noise”. At the same time, the 
covariance function “tends to the Dirac function” and the energy C},(0) = ; tends 
to infinity. This is another feature of white noise: unpredictability. Indeed, for 
T >h, the value B,(t +7) cannot be predicted from the value B;,(t), since both 
are independent random variables. 


One then lets 
f(H)B(t) dt = f()Br(t) dt. 
Ry Ry 
The Wiener Approach to White Noise 


The third approach to white noise differs from the previous ones, involving 
limits, in that it consists in working right away “at the limit”. 


In this approach, one does not attempt to define the white noise {B(t)}icr 
directly (for good reasons since it does not exist as a bona fide WSS stochastic pro- 
cess, as we noted earlier). Instead, the symbolic integral te f(t) B(t) dt is defined, 
for integrands f to be described below, by 


| f()B(t) at = | f(t) dZ(t), (12.15) 
R R 
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where {Z(t)},cg is a centered stochastic process with uncorrelated increments. 
One then says that {B(t)},cp is a white noise and that { Z(t)},cp is an integrated 
white noise. 


When {Z(t)} cp = {W(t)},<g, a standard Brownian motion, { B(t)},cp is called 
a Gaussian white noise. 


In the Gaussian white noise case, we have that for all f,g € Li.(R), 


| [#0 a(t) a =, 


and by the isometry formulas for the Doob—Wiener integral, 


El({ 70 anar) (f g(t) BY) a) | = [ fos 


which can be formally rewritten, using the Dirac symbolism: 


fat t)g(s)* EB [B(t)B*(s)| ards = f(t)g(s)* 6(t — s) dt ds. 


Hence “the covariance function of the white noise {B(t)}:er is a Dirac pseudo- 


function: Cp(T) = d(T)”. 


Let {B(t)},cg be a white noise with structural measure 1, for example the 
Gaussian white noise. Let h :R — C be in LM Li, and define the output of a 
filter with impulse response h when the white noise {B(t)},cp is the input, by 


Y(t) -| h(t — s)B(t) ds. 
R 
By the isometry formula for the Wiener—Doob integral, 
E|Y (t)Y (s)*] = | h(t —s—u)h*(u) du, 
R 
and therefore (Plancherel—Parseval equality) 
Cy(rT) = eT h(v)|? dv. 
R 


The stochastic process {Y(t)},cp is therefore centered and wss, with power spec- 
tral density 


fy(v) =|h)P fe), 


12.3. THE CRAMER-KHINCHIN DECOMPOSITION 459 


where 
fa(v) =. 


We therefore once more recover formally the fundamental equation of linear filter- 
ing of WSS continuous-time stochastic processes. 


The connection with the approximate derivative approach is the following: For 
all f € L4(R)NLE(R), 


hLO 


lim i f(t) Bp(t) dt = [ f(t) W(t) 


in the quadratic mean. The proof is omitted. 


12.3. The Cramér—Khinchin Decomposition 


Almost surely, a trajectory of a stationary stochastic process is neither in L4,(¢) 
nor in L7,(¢), unless it is identically null. The formal argument will not be given 
here”, but the examples show this convincingly. Therefore such trajectory does 
not have a Fourier transform in the usual senses. There exists however, in some 
particular sense, a kind of Fourier spectral decomposition of the trajectories of a 
wss stochastic process, as we shall now see. 


Theorem 12.3.1 Let {X(t)}iceg be a centered WSS stochastic process, continuous 
in the quadratic mean, and let ts be its power spectral measure. There exists a 
unique (more precision below the theorem) centered stochastic process {x(v)} cp 
with uncorrelated increments and with structural measure pu, such that for allt € R, 
P-a.s., 


X@) = jew dx(v) , (12.16) 


where the integral on the right-hand side is a Doob integral. 


The decomposition (12.16) is unique in the following sense: If there exists 
another centered stochastic process {7(v)},<, with uncorrelated increments, and 
with finite structural measure ji, such that for all t € R, we have P-a.s., X(t) = 
Jpe™ d&(v) , then for all a,b € R, a < b, %(b) — &(a) = 2(0) — 2(a), P-as. 


We shall say: “da(v) is the (Cramer—Khinchin) spectral decomposition” of the 
Wss stochastic process. 


2 See Remark 12.1.1 of [7]. 
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Proof. 1. Denote by H(X) the vector subspace of L2,(P) formed by the finite 
complex linear combinations of the type 


K 
Z= Ss AXE) 
k=1 


and let us denote by y the mapping of H(X) into L2,(j) defined by 
K 
yp: try A\permete 
k=1 
Using Bochner’s theorem, we verify that it is a linear isometry of H(X) into LZ,(): 


2 


= S72 SO AAEE [X (te) X (te) 


k=1 @=1 


KK q 
= SOS AAC (ie - te) = a x AAZ a errr‘) (du) 
£ 
K O&K 
papaes Me Qinv(th— ‘ m (dv) yon eri 
1 


p(dv) . 


2. This isometric linear mapping can be uniquely extended to an isometric linear 
mapping (that we shall continue to call y) from H(X), the closure of H(X), into 
12,(u) (Theorem A.0.6). As the combinations “i, \,e2"”* are dense in L?,(1) 
when ju is a finite measure?, y is onto. Therefore, it is a linear isometric bijection 
between H(X) and L2,(j1). 


3. Let x(vp) be the random variable in H(X) that corresponds in this isometry to 
the function 1(_..,|(v) of Li,(u). First, observe that 


E|a(v2) — x(%1)| = 0 


since H(X) is the closure in L2,(P) of a family of centered random variables. Also, 
by isometry, 


El(x(v2) — 2(1))(@(va) — @¥s))"] = [reealMervale) nla) 


= U((1, Ye] N (v3, 4) - 


3 This will be admitted. 
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One can therefore define the Doob integral J, f(v) da(v) for all f € La (). 


a k+1 k 
Zn(t) = yee ) («( - ) _ (=) ) ; 
kez 


lim Z,(t) = | erm da(v) 


4. Let now 


We have 


(limit in L4(P)). In fact, 


Zn(t) = f falls») de(v). 


where 


Flt, v) = D7 O71 jon cesry/a(¥) , 
keZ 


and therefore, by isometry, 


2 
E = [ler — Bl yP alae, 
R 


Zn(t)— fe anv) 


a quantity which tends to zero when n tends to infinity (by dominated convergence, 
using the fact that jz: is a bounded measure). On the other hand, by definition of 


?, 
Zn(t) > falt,v) 


Since, for fixed t, lim, .. Z,(t) = i ett de(v) im LEP) and Utin ons falt7) = 
e2inut in TAG), 


: @ , 
ae da(v) yy e2imut . 
R 


But, by definition of y, 


Therefore X(t) = f,e""™”' da(v). 


5. We now prove uniqueness. Suppose that there exists another spectral de- 
composition di(v). Denote by G the set of finite linear combinations of complex 
exponentials. Since by hypothesis 


[eran = femrase) (= xi) 


we have 


[ feraee) = f reyaze) 
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for all f € G, and therefore, for all f € Li(u)N Li(fi) C LE($(u+ f)) because G 
is dense in Lz,($( + ji)). In particular, with f = liay), 


x(b) — x(a) = &(b) — &(a). 


More details can be obtained as to the continuity properties (in the quadratic 
mean) of the increments of the spectral decomposition. For instance, it is right- 
continuous in the quadratic mean, and it admits a left-hand limit in the quadratic 
mean at any point v € R. If such limit is denoted by x(v—), then, for alla € R, 


El|x(a) — x(a—)|?] = w({a}). 
Proof. The right-continuity follows from the continuity of the (finite) measure ju: 
“ _ 2 = it = = 
lim El|e(a + h) — 2(a)/"] = lim p((a,a + Al) = w(@) = 0. 


As for the existence of left-hand limits, it is guaranteed by the Cauchy criterion, 
since for alla € R, 


Ella(a— h) — 2(a— h/)[?] = y((a—h',a— hl) =0. 


lim 
h,h!\0,h<h! h ei ach! 
Finally, 


E|\x(a) — x(a—))?] = lim El|x(a) — #(a — hy] = lim y((a — h, a]) = w({a})- 


Theorem 12.3.2 Let {X(t)}icr be a WSS stochastic process continuous in the 
quadratic mean. It is real if and only if its spectral decomposition is hermitian 
symmetric, that is, for all [a,b] C R, 


x(b) — x(a) = (a(-a_) — #(—b_))*. 


Proof. If the stochastic process is real, 


X(t) = i ei de(v) = ( / eine ax(v)) 


SF da*(v) Se dzx*(—v), 
R R 
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and therefore, by uniqueness of the spectral decomposition, da(v) = dx*(—v). 
Similarly, if dz(v) = dx*(—v), 


X(t) = i ett der(v) 


= [err an=(fetmtacey) = x0" 


and therefore the process is real. 


Theorem 12.3.3 Let {X(t)}ter be a centered WSS stochastic process continuous 
in the quadratic mean. Then 


Ho(x(v);v € R) = Hc(X(t);t € R) 


and both Hilbert subspaces are identical with 
(2= f ovary; 9 € LEW}. 
R 


Proof. 1. For ally € R, x(v) € He(X(t);t € R) (by definition of x(v); see the 
proof of Theorem 12.3.1). Therefore, 


Ho(a(v);v € R) C Ao(X(t);t € R). 
On the other hand, for allt € R, X(t) = fpe-""™' da(v) € He(a(v);v € R). 
Therefore 

He (X(t);t € R) C Ac(a(v);v € R). 


2. Defining H := {Z = f, g(v) da(v); g € Le (u)}, then H C Hc(x(v). Moreover, 
since H contains all the X(t) = J, e-?"""' dx(v), Hc(X(t);t € R) C H. Therefore 


Ae (X(t);t ER) CAC Ac(a(v) 


and the conclusion follows from Part 1 of the proof. 


A Plancherel—Parseval Formula 


The following result is the analog of the Plancherel—Parseval formula of classical 
Fourier analysis. 
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Theorem 12.3.4 Let f : R > C be in Li (R) with Fourier transform f. Let 
{X(t)} em be a centered WSS stochastic process with power spectral measure js and 
Cramér-Khinchin spectral decomposition dx(v). Then: 


[foram = f 0 (12.17) 


Proof. Since f is bounded and continuous (as the Fourier transform of an inte- 
grable function), and since pu is a finite measure, we have that f € L2,(w), and 


s f (=) Looe ety) = f in Li(p) 


and therefore (all limits in the following sequence of equalities are in L7,(P)): 
n2”—1 * 
‘ k+1 k 
[ora oe EBV) (2) 
is 
ss > (fro Prete at) (a (R41) _ of * 
0 2” QP 


n2”—1 
on (he/2” k+1 k 
_ 4; * 42in(k/2”)t pp os 
im [rod [s (ts) (2) 
= lini | f*(t)Xp(t) at 
n-+00 Jp 


where 


n2”?—1 

i k+1 k 
a 42in(k/2 aC ( > ) +(=)) > X(t) in L3(P). 
—n2” 


The announced result will then follow once we prove that 


tim [ roxwe=f roxeat 


where the limit is in L2(P). In fact, with Y,,(t) = X(t) — X,(t) 


ya | - [ [torr ercorstor] ees. 
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But limptoo Yn(t) = 0 (in L3,(P)) and therefore limptoo E [Yn(t)¥n(s)*] = 0. More- 
over E'[Y,,(t)¥;(s)*] is uniformly bounded in n. Therefore, by dominated conver- 
gence, 


lim ff syste)" [Yn(t)Yn(s)*] dtds =0. 


ntoo 


EXAMPLE 12.3.5: CONVOLUTIONAL FILTERING. Let h € L4(R) and let h be 
its Fourier transform. Then 


i. h(t — s)X(s) ds = | h(v)e2"™"" da(v) . (12.18) 


R 


Proof. It suffices to apply (12.17) to the function s ++ h*(t — s), whose Fourier 
transform is h(v)*e~ 7", 


Linear Operations on WSs Stochastic Processes 


A function g : R + C in L(y) defines a linear operation on the centered Wss 
stochastic process {X(t)},cp (called the input) by associating with it the centered 
stochastic process (called the output) 


Y(t) = [eran da(v) . (12.19) 


On the other hand, the calculation of the covariance function 
Cy(r) = E[Y(t)Y (t+ 7)*] 


of the output gives, by isometry, 


Cy(r) = | eI g(v)/? xx(dv), 


where j1x is the power spectral measure of the input. The power spectral measure 
of the output process is then 


pry (dv) = |g(v)[? wx(av) (12.20) 


This is similar to the formula obtained when {Y(t)},<p is the output of a 


stable convolutional filter with impulse response h and transmittance h: py (dv) = 
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|h(v)|? ux(dv). We therefore say that g is the transmittance of the “filter” (12.19). 
Note however that this filter is not necessarily of the convolutional type, since g 
may well not be the Fourier transform of an integrable function (for instance it 
may be unbounded, as the next example shows). 


EXAMPLE 12.3.6: DIFFERENTIATION. Let {X(t)},<g be a WSS stochastic pro- 
cesses with spectral measure /1x such that 


[Pasta < 00. (1921) 


Then 
am X(t +h) — X(t) 
h—0 h 


= [imyerran(r), 
R 


where the limit is in the quadratic mean. The linear operation corresponding to 
the transmittance g(v) = 2imv is therefore the differentiation in quadratic mean. 


Proof. Let h € R. From the equality 


X(t h) — X(t , : 2invh __ 1 
cia Rees) - [ (aimee = | a — - 2inv dz(v) 


we have, by isometry, 


X(t+h)— X(t ; ? 
lim B aa — [inner aay) | 
h0 h R 
e2imvh _ 2 
- tim, [ a is 2inv| px(dv). 
P 2 
The latter limit is 0, by dominated convergence, since cae 2inv| < 4r?V? 


h 


and in view of the hypothesis (12.21). 


“A line spectrum corresponds to a combination of sinusoids.” More precisely: 
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Theorem 12.3.7 Let {X(t)},cp be a centered WSs stochastic processes with spec- 
tral measure 


pix (dv) = Se Prev, (dv), 


keZ 


where €,, 18 the Dirac measure at vy, € R, P, € R+ and rez Py < co. Then 


X(t) = ey Uperene : 


keZ 


where {Ux}rez is a sequence of centered uncorrelated square-integrable complex 
variables, and E||U;,|?] = Pp. 


Proof. Let 
gv) = So 1pyy(v). 


keZ 


It is in Le (ux), as is 1—g(v). Also f, |1 — g(v)|? ux (dv) = 0, and in particular 
Jp — g(v))e?"""' dx(v) = 0. Therefore 


X(t) = f oye an(v) 


= YS ere! (r(vy) — a(vp—)). 


keZ 


We conclude by defining U, = 2(%,) — x(v%—). 


Linear Transformations of Gaussian Processes 


We call a linear transformation of the wss stochastic process {X(t)},cp a trans- 
formation of it into the second-order process (not WSS in general) 


Y(t) = [io t) da(v), (12.22) 


where 
| lo(t,v)/? ux(dv) <oo forallteER. 
R 


Theorem 12.3.8 Every linear transformation of a Gaussian WSS stochastic pro- 
cess yields a Gaussian stochastic process. 


Proof. Let {X(t)},<g be centered, Gaussian, wss, with Cramer—Khinchin de- 
composition da(v). For each v € R, the random variable x(v) is in Hp(X), by 
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construction. Now, if {X(t)},cp is a Gaussian process, Hp(X) is a Gaussian sub- 
space. But (Theorem 12.3.3) Hp(X) = Hp(x). Therefore the process (12.22) is in 
H(X), hence Gaussian. 


EXAMPLE 12.3.9: CONVOLUTIONAL FILTERING OF A WSS GAUSSIAN PROCESS. 
In particular, if {X(¢)},¢g is a Gaussian Wss process with Cramer—Khinchin de- 
composition da(v), and if g € L2,(j1x), the process 


Y (t) = fomrav) da(v) 


is a Gaussian process. 


A particular case is when g = h, the Fourier transform of a filter with integrable 
impulse response h; the signal {Y(t)},<p is the one obtained by convolutional 
filtering of {X(t)},¢g with this filter. 


12.4 Multivariate wss Stochastic Processes 


Let {X(t)},eg be a stochastic process with values in E := C*, where L is an 
integer greater than or equal to 2: X(t) = (X(t), ..., Xz(¢)). This process is 
assumed to be of the second order, that: is: 


E|\|X(t)||?] < co = forall tER, 


and centered. Furthermore, it will be assumed that it is wide-sense stationary, 
in the sense that the mean vector of X(t) and the cross-covariance matrix of the 
vectors X(t+ 7) and X(t) do not depend upon t. The matrix-valued function C 
defined by 


C(r) = cov (X(t +7), X(t) (12.23) 


is called the (matrix) covariance function of the stochastic process. Its general 
entry is 


Cij(7) = cov(X;(t), X;(t+7)). 


Therefore, each of the processes {X;(t)},<g is a WSS stochastic process, but, fur- 
thermore, they are stateonarily correlated or “jointly wss”. The vector-valued 
stochastic process {X(t)},<p is then called a multivariate wss stochastic process. 
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EXAMPLE 12.4.1: SIGNAL PLUS NOISE. The following model frequently appears 
in signal processing: 


Y(t) = S(t) + Bit), 


where {5(t)},cp and {B(t)},<, are two uncorrelated centered WSs stochastic pro- 
cesses with respective covariance functions Cs and Cg. Then, {(Y(t), BO) hep 
is a bivariate WSS stochastic process. In fact, by the assumption of non-correlation: 


ere me 
sl ees a 


We shall need at this point a minor extension of the notion of measure. 


Definition 12.4.2 A finite complex measure on the measurable space (X, 4) is, 
by definition, a mapping uw: X — C of the form 


w=UR+ ip, 


where [iz and py are finite measures on (X,X¥). The integral of a measurable 
function f : (X,¥) > (R, B(R) with respect to such measure is defined by 


J femtae) = f te)unlar) +i f four(ar) 
x x x 
whenever f is integrable with respect to both zp and py. 


Theorem 12.4.3 Let {X(t)},cp be an L-dimensional multivariate wss stochastic 
process. For all r,s (1 <r,s < L) there exists a finite complez measure [;; such 
that 


Col) = i eed. (eA) 
R 
Proof. (The case r = 1, s = 2). Let us consider the stochastic processes 


Y(t) = Xi (t) + Xo(t), Z(t) =1X,(t) + Xo(t). 


These are WSS stochastic processes with respective covariance functions 


Cy(T) = C\(rT) t C2(rT) t Ci2(T) t Cx (7), 
Cz(r) = Ci(rT) t C2(T) t iC2(T) = iCg(rT). 
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iFrom these two equalities we deduce 


Clr) = 5 {ICr(7) Ci(r) — C2(7)] — [Cz(r) — Ci(r) + Ca(7)I}, 


from which the result follows with 


1 : 
/12 = 5 {[Ly Ly }2] iluz My fa] } . 


The matrix 
M:= {Highr<ijen 
(whose entries are finite complex measures) is the interspectral power measure 
matri« of the multivariate Wss stochastic process {X(t)},cg. It is clear that for 
all z = (1,..., 2) € C*, U(t) = z? X(t) defines a wss stochastic process with 
spectral measure py = z M 21 (recall that | means transpose conjugate). 


The link between the interspectral measure [12 and the Cramer—Khinchin de- 
compositions dz,(v) and dx2(v) is the following: 
E[x1(v2) — v1(1))(t2(v4) — ©2(v3))"] = fra((1, ¥2] U (v3, 4) - 


This is a particular case of the following: for all functions g; : R > C, g; € Li (ui) 
(=1,2) 


B[(f sterarsiey) ( f vetraratry) | = ff oxtrroetoy’ snalde). (12.25) 


Indeed, equality (12.25) is true for gi(v) = e”'™", go(v) = e”*™”, since it then 
reduces to 


B[X1(t)X2(t)*] = ume Hig (dv) s 


This is therefore verified for g1,g2 € €, the set of finite linear combinations of 
functions of the type v > e”**” (t E R). But € is dense in L2,(ju;) (¢ = 1,2),* and 
therefore the equality (12.25) is true for all g; € Lz,(ui) (¢ = 1,2). 


Theorem 12.4.4 The interspectral measure [112 is absolutely continuous with re- 
spect to each of the spectral measures [1 and ju. 


Proof. This means that j1j2(A) = 0 whenever ju;(A) = 0 or ju2(A) = 0. Indeed, 


yn2(A) = al I az) (/, ws) | 


4 This will be admitted. 
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and j1;(A) = 0 implies f,dZ, = 0 since 


2 


= f1(A). 


dz, 
A 


Therefore, each of the spectral measures j1,; is absolutely continuous with re- 


spect to the trace 
k 
TrM := > [hj 
j=1 


of the power spectral measure matrix. By the Radon—Nikodym theorem there 
exists a function g;; : R — C such that 


(A) = f gy(v) TH M(dv). 
A 
We say that the matrix 
gv) = {9i3(Y) bi<ijck 
is the canonical spectral density matrix of {X(t)},<,. One should insist that it 


is not required that the stochastic processes {X;(t)},<p, 1 <i < k, admit power 
spectral densities. 


The correlation matrix C(r) has, with the above notations, the representation 
C(r) = | e”"™"7 g(v) Tr M(dv). 
R 


If each of the wss stochastic processes {X;(t)},<2 admits a spectral density, {X(t)} cp 
admits an interspectral density matrix 


fv) = {fi ir<ijes ? 
that is: 
Ci;(7) = cov (X,(t+ 7), X;(6)) = ll env f(y) dv. 


EXAMPLE 12.4.5: INTERFERENCES. Let {X(t)},<g be a centered Wss stochas- 
tic process with power spectral measure px. Let hy, he : R + C be integrable 
functions with respective Fourier transforms hy and ho. Define for 7 = 1, 2, 


vib) = f lt s)X(s) as, 
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The wss stochastic processes {Yi(t)},<g and {Y2(t)},cp are stationarily correlated. 
In fact (assuming that they are centered, without loss of generality), 


eeLaRENS el(f hi(t-+7 — 8) X(s) as) (/ ha(t — s)X(s) a) 
Z [ [mere we aCxi—o dude 
= [ fae — u)ht(—v)Cx(u—v) dudo, 


and this quantity depends only upon 7. Replacing Cx (u — v) by its expression in 
terms of the spectral measure jix, one obtains 


Crinalr) = fer Ty (W)Ts() ux (a). 


The power spectral matrix of the bivariate process {Y\(t), Yo(t)},<p is therefore 


IOP Ty) Tv) 
wrt) = (nti) Ween) x2): 


Band-pass Stochastic Processes 


Let {X(t)},<p be a centered wss stochastic process with power spectral measure 
jx and Cramér—Khinchin decomposition dz(v). This process is assumed real, and 
therefore 


Lix(—dv) = px(dv), dz(—v) = da(v)*. 


Definition 12.4.6 The above WSs stochastic process is called band-pass (1%, B), 
where 4 > B > 0, if the support of x is contained in the frequency band 
[—v) — B,-v) + B| U [% — B,vyy + BY). It is called base-band (B) in if in ad- 
dition vo = 0. 


Our purpose is to show that such a band-pass stochastic process admits the 
following quadrature decomposition 
X(t) = M(t) cos 2mvpt — N(t) sin 2rvot , (12.26) 


where {M(t)},cp and {N(t)},cg, called the quadrature components, are real base- 
band (B) wss stochastic process. To prove this, let G(v) := —7 sign(v) (= 0 if 
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v = 0). The function G is the so-called Hilbert filter transmittance. The quadrature 
process associated with {X(t)},cp is defined by 


Y(t) = [ewe da(v) . 


The right-hand side of the preceding equality is well defined since 
Je (GM)? ux (dv) = x(R) < co. Moreover, this stochastic process is real, since 
its spectral decomposition is hermitian symmetric. The analytic process associated 
with {X(t)},<p is, by definition, the stochastic process 


Z(t) = X(t) +7Y(t) = fo + iG(v))e"™ da(v) = 2 | eit dr(v) . 


(0,00) 


Taking into account that |G(v)|? = 1, the preceding expressions and the Wiener 
isometry formulas lead to the following properties: 


by (dv) = px (dv), Cy(r) = Cx(r), Cxy(T) = —Cyx(r), 


bz(dv) =41r,(v)px(dv), — Cz(r) = 2{Cx(r) + iCyx(7)}, 


and 


E|Z(t+7)Z(t)| =0. (x) 
Defining the complex envelope of {X(t)}.cg by 


U(t) = Z(t)", (xx) 
it follows from this definition that 
Cy(r) =e "™"Cz(r), Hu (dv) = pz (dv + v9), (T) 
whetine (6 nd OE Ate 
E[U(t + 7)U(t)| = 0. (it) 


The quadrature components {M(t)},<p and {N(t)},cp of {X(t} cg are the real 
WSS stochastic processes defined by 


U(t) = M(t) +iN(t). 
Since 


X(t) = Re{Z(t)} = Re{U(t)e*""""} , 
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we have the decomposition (12.26). Taking ({T) into account we obtain: 


Cult) = Onl) = 5 (Colt) + Coley} 


and 
Cyn (T) = Cym(T) = - {Cyu(r) — Cu(r)*} , (>) 


and the corresponding relations for the spectra 


bm (dv) = wn (dv) = {ux (dv — vo) + ux (dv + vo) } L-w,+5)(/) - 


From (<>) and the observation that Cy(0) = Cy(0)* (since Cy(0) = E[|U(0)|?] 
is real), we deduce Cyn (0) = 0, that is to say, 


E[M(t)N(t)] =0. (12.27) 


If, furthermore, the original process has a power spectral measure that is sym- 
metric about vp in the band [v) — B,vp + Bl, the same holds for the spectrum 
of the analytic process and, by ({), the complex envelope has a spectral measure 
symmetric about 0, which implies Cy;(7) = Cy(r)* and then, by (%), 


B[M(t)N(t+7)] =0. (12.28) 


In summary: 


Theorem 12.4.7 Let {X(t)},cp be a centered real band-pass (vp, B) WSS stochas- 
tic process. The values of its quadrature components at a given time are uncor- 
related. Moreover, if the original stochastic process has a power spectral measure 
symmetric about vo, the quadrature component processes are uncorrelated. 


More can be said when the original process is Gaussian. In this case, the 
quadrature component processes are jointly Gaussian (being obtained from the 
original Gaussian process by linear operations). In particular, for all t € R, M(#) 
and N(t) are jointly Gaussian and uncorrelated, and therefore independent. 


If moreover the original process has a spectrum symmetric about vp, then, by 
(12.28), M(t,) and N(t2) (t1,t2 € R) are uncorrelated jointly Gaussian variables, 
and therefore independent. In other words, the quadrature component processes 
are two independent centered Gaussian WSS stochastic processes. 
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12.5 Exercises 


Exercise 12.5.1. APPROXIMATE DERIVATIVE OF THE BROWNIAN MOTION 
Prove Formula 12.14. 


Exercise 12.5.2. STATIONARIZATION OF A CYCLIC STOCHASTIC PROCESS 
Let {Y(t)}:50 be the stochastic process taking its values in {—1,+1} defined by 


Y(t) = Z x (-1)" on (nT, (n+ 1)T] (n>0), 


where T' is a positive real number and Z is a random variable equidistributed on 
{-1,+1}. 


(1) Show that {Y(t)}:50 is not a stationary (neither strictly nor in the wide sense) 
stochastic process. 


(2) Let now U be a random variable uniformly distributed on [0,7] and indepen- 
dent of Z. Define for all t > 0, 


X(t)=Y(t—U)*. 


Show that {X(t)}iso is a strictly stationary stochastic process and compute its 
covariance function. 


Exercise 12.5.3. AN ERGODIC PROPERTY 
Let {X(t)},59 be a wide-sense stationary stochastic process with mean m and 
covariance function C'(7). Prove that in order that 


Ttoo 


1 f 
lim zf X (s)ds = mx 


holds in the quadratic mean, it is necessary and sufficient that 


ee “(1 =) C(u)d 0 (12.29) 
im = —=)C(u)du=0. : 

T too 0 Le 

Show that this condition is satisfied in particular when the covariance function is 
integrable. 


Exercise 12.5.4. SYMMETRIC POWER SPECTRAL MEASURE 
Show that the power spectral measure of a real WSS stochastic process is symmet- 
ric. 


Exercise 12.5.5. PRODUCTS OF INDEPENDENT WSS STOCHASTIC PROCESSES 
Let {X(t)}.cp and {Y (t)},<, be two independent centered wss stochastic pro- 
cesses of respective covariance functions C'y (7) and Cy (r). 
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1. Show that Z(t) := X(t)Y (t) (t € R) is a WSs stochastic process. Give its 
mean and covariance function. 


2. Assume in addition that {X(t)},¢, is the harmonic process of Example 
11.1.14. Suppose that {Y (t)},¢2 admits a power spectral density fy (v). 
Give the power spectral density fz (v) of {Z (t)} ier. 


Exercise 12.5.6. THE APPROXIMATE DERIVATIVE OF A WIENER PROCESS 
Let {W(t)},., be a Wiener process. Show that for a > 0, the stochastic process 


W (t+a)—W (t) 


a 


X(t) = (t €R) 


is a WSS stochastic process. Compute its mean, its covariance function and its 
power spectral density. 


Exercise 12.5.7. THE SQUARE OF A BAND-LIMITED WHITE NOISE 
Let {X(t)},<,2 be a wide-sense stationary centered Gaussian process with covari- 
ance function C'y(r) and with the power spectral density 


Sega le) 


2 
where No > 0 and B > 0. 
1. Let Y(t) = X(t)’. Show that {Y(t)},¢g is a wide-sense stationary process. 


2. Give its power spectral density fy(v). 


Exercise 12.5.8. PROJECTION OF WHITE NOISE ONTO AN ORTHONORMAL BASE 
Let the set of square-integrable functions y : [0,T] > R (1 <7 < N) be such that 


ee 
[ eWeiar=a, Asig<), 
0 


and let {B(t)},cg be a Gaussian white noise with psD 1. Show that the vector 
B=(B,,...,By)" defined by 


B; = [ Bweto dt (1<i<N) 


is a centered Gaussian vector with covariance matrix [, = J, the identity matrix 
of size N (In particular, the components B,,..., By are identically distributed, 
independent, and centered Gaussian random variables with common variance 1.) 
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Exercise 12.5.9. AN IID SEQUENCE CARRIED BY AN HPP 

Let N be a homogeneous Poisson process on R, of intensity A > 0, and let 
{Zn}nso be an ID sequence of integrable real random variables, centered, with 
finite variance o”, and independent of N. 


1) Show that {Z((0,4) }i>0 is a wide-sense stationary stochastic process and give 
its covariance function. 


2) Give its power spectral density. 


3) Compute P(X (t,) = X (t2)) and P(X (t1) > X (t2)). 


Exercise 12.5.10. POISSON SHOT NOISES 

Let N,, No and N3 be three independent homogeneous Poisson processes on R 
with respective intensities 6; > 0, 02 > 0 and 63 > 0. Let {Xi(t)}:er be the shot 
noise constructed on N; + N3 with an impulse function h : R > R that is bounded 
and with compact support (null outside a finite interval). Let {X2(t)},ep be the 
shot noise constructed on Nz + N3 with the same impulse function h. 


Compute the power spectral density of the wide-sense stationary process { X(t) }ier, 


where X(t) = Xi (t) + Xo(t). 


Exercise 12.5.11. FLIP-FLOP 
Let N be an HPP on R, with intensity A. Define the (telegraph or flip-flop) process 
{X (t)},55 with state space # = {+1,—1} by 


X(t) =Z(-1), 


where X (0) = Z is an E-valued random variable independent of the counting 
process N. (Thus the telegraph process switches between —1 and +1 at each 
event of N.) The probability distribution of Z is arbitrary. 


1. Compute P(X (t + s) = j|X (s) =7) for all t, s > O and alli, j € E. 
2. Give, for all 7 € E, the limit of P (X (t) = 7) as t tends to oo. 


3. Show that when P(Z = 1) = s, the process is a stationary process and give 
its power spectral measure. 


Exercise 12.5.12. FLIP-FLOP WITH LIMITED MEMORY 
Let N be a HPP on R with intensity 1 > 0. Define for allt € R 


X() = ere), 
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1. Show that {X(¢)},<g is a WSs stochastic process. 
2. Compute its power spectral density. 


3. Give the best affine estimate of X (t +7) in terms of X(t), that is, find a, 6 
minimizing 


E [|X (t+ 7) —(a + BX(t))|7] , When 7 > 0. 


Exercise 12.5.13. JUMPING PHASE 
Define for each t € R, t > 0, 
X(t) = Pn, 


where {JN (t)},. is the counting process of a homogeneous Poisson process on R 
with intensity \ > 0, and {®,},,., is an IID sequence of random variables uniformly 
distributed on [0,27], and independent of the Poisson process. 


Show that { X(t) },.5 is a wide-sense stationary process, give its covariance function 
Cx (rT) and its power spectral measure. 


Appendix A 


A Review of Hilbert Spaces 


Basic Definitions 


Let H be a vector space with scalar field K = C or R, endowed with a map 
(x,y) € Hx H > (x,y) € K such that for all x,y,z € H andall\e€ K, 


L. (y,@) = (2, y)", 

2. (Ay, a) = A(y, 2), 

3. (vz, y+ 2) = (2, y) + (2,2), 

4. (x, a) > 0; and (x,x) =0 if and only if « =0. 


Then H is called a pre-Hilbert space over K and (x,y) is called the inner product 
of x and y. For any x € E, define 


The parallelogram identity 


; 1 
lei? + llyll? = S(lle + yl? + [le — yl?) 
is obtained by expanding the right-hand side and using the equality 
ll + yll? = llall? + Ilyll? + 2Re {(z, y)} - 


The polarization identity 


1 ; : : ; 
(e,y) = 5 {lle + ull? = lle — yl? + alla: + ty? — alla — ey)? 


is checked by expanding the right-hand side. It shows in particular that two inner 
products (-,-); and (-,-)2 on E such that || - ||; = || - |/2 are identical. 
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Schwarz’s Inequality 
Theorem A.0.1 For all x,y € H, 


(x, 91S llell = Ilyll- 


Equality occurs if and only if x and y are colinear. 


Proof. Say K = C. If x and y are colinear, that is, c = Ay for some  € C, the 
inequality is obviously an equality. If x and y are linearly independent, then for 
all A € C, «+ Ay 4 0. Therefore 


0 < lla + Ayll? = Hall? + |Ayl*lAgll? + A*(x,y) + A(z, y)* 
= |[arl|? + AP llyl|? + 2Re(A*(a, y)) 


Take u € C, |u| = 1, such that u* (x,y) = |(x,y)|. Take any t € R and put A = tu. 
Then 
0 < [lal + ellyll? + 2é(x,y)] . 

This being true for all t € R, the discriminant of the second degree polynomial in t 
of the right-hand side must be strictly negative, that is, 4|(x, y)|? —4||z||? x |ly||? < 
0. 
Theorem A.0.2 The mapping x > ||a|| 1s a norm on EF, that is to say, for all 
xv,y € E, and alla€é C, 


(a) ||a|| > 0; and ||x|| = 0 af and only if x =0, 
(b) |lax|] = lal |la||, and 
(a Isle eect 
Proof. The proof of (a) and (b) is immediate. For (c) write 


IIx + all? = [lell? + llyll? + (ey) + (yx) 


and 


(lll + Ilyll)? = ell? + Myll? + 2g 


It therefore suffices to prove 


(x,y) + (y, x) = 2Re((x,y)) < 2|lz\I[Iyll, 


which follows from Schwarz’s inequality. 


The norm || - || induces a metric d(-,-) on H by 


d(z,y) = |lx —y|l- 
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Recall that a mapping d: Ex EF > R, is called a metric on F if, for all x,y,z € E, 
(a’) d(x,y) > 0; and d(x, y) = 0 if and only if x = y, 
(0') d(x,y) = d(y,x), and 
(¢) d(x,y) > dw, z) + d(z,y). 


The above properties are immediate consequences of (a), (b), and (c) of Theorem 
A.0.2. When endowed with a metric, a space H is called a metric space. 


Definition A.0.3 A pre-Hilbert space H is called a Hilbert space if it is a@ com- 
plete metric space with respect to the metric d. 


By this, the following is meant: If {z,,},., is a Cauchy sequence in H, that is, if 
limpnntoo U(Lm, Ln) = 0, then there exists an x € H such that lim,4.. d(#,, x) = 0. 


Theorem A.0.4 Let {xn},,5, and {Yn}ns, be sequences in a Hilbert space H that 


converge to x and y, respectively. Then, 


“a (Grate) = (ru): 


m,ntoo 


In other words, the inner product of a Hilbert space is bicontinuous. In partic- 
ular, the norm x ++ ||2’|| is a continuous function from H to R,. 


Proof. We have for all hy, hz in H, 
(x + hi,y + he) — (x, y)| = |(@, he) + (ha, y) + (ha, ha) - 


By Schwarz’s inequality |(x,ho)| < [ell all, [ans4¥)| < lylll@all, and (in, a)l < 
\|h,||||he||. Therefore 


lim at+hiyth Raen, 
cana 1, + Ae) — (2, y)| 


Isometric Extension 


Definition A.0.5 Let H and K be two Hilbert spaces with inner products denoted 
by (-,-)a and (-,-)K, respectively, and let p: H # K be a linear mapping such 
that for allx,y € H 

(o(2), Y))« = (ty) H- 
Then, ip is called a linear isometry from H into K. If, moreover, yp is from H 
onto K, then H and K are said to be isomorphic. 
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Note that a linear isometry is necessarily injective, since y(a) = y(y) implies 
p(x — y) = 0, and therefore 


0= |le(@— yx = lle — lla, 


which implies x = y. In particular, if the linear isometry is onto, it is bijective. 


Recall that a subset A € E, where (F,d) is a metric space, is said to be dense 
in F if, for all x € E, there exists a sequence {Za hast in A converging to x. 


Theorem A.0.6 Let H and K be two Hilbert spaces with inner products denoted 
by (-,:) and (-,-) Kc, respectively. Let V be a vector subspace of H that is dense in 
H, andy: V+ K be alinear isometry from V to K. Then, there exists a unique 
linear isometry @: H +> K whose restriction to V is yp. 


Proof. We shall first define (a) for  € H. Since V is dense in H, there exists a 
sequence {z,},., in V converging to x. Since ¢ is isometric, 

lly(an) — p(@m)||K = ||@n —2m||y for all m,n >1. 
In particular, {y(z,)},,., is a Cauchy sequence in K and therefore it converges to 
some element of K, which we denote by ¢(z). 


The definition of (x) is independent of the sequence {z,,},,, converging to 2. 
Indeed, for another such sequence {y,},51, 


lim y(n) — 9(Yn) lle = lim |2n — Ynlla = 0. 


The mapping ¢ : H ++ K so constructed is clearly an extension of y (for « € V 
one can take for an approximating sequence of x the sequence {z,,},., such that 
In = 2). 7 

The mapping ¢ is linear. Indeed, let x,y € H, a,8 € C, and let {z,},., 
and {yn}, be two sequences in V converging to x and y, respectively. Then 
{arp + BYn},s1 converges to ax + By. Therefore 


lim plat, + BYn) = Plax + By). 
But 


p(aan + BYn) = ap(tn) + Be(Yn) 4 aG(x) + Oly) 


tends to Plax + By) = ap(x) + BEY). 
The mapping ¢ is isometric since, in view of the bicontinuity of the inner 
product and of the isometricity of ¢, if {vn},51 and {Yn},>1 are two sequences in 


APPENDIX A. A REVIEW OF HILBERT SPACES 483 


V converging to x and y, respectively, then 


(P(x), P(y))« = lim (p(n), Y(Yn)) K 


~ Him (tn, Yn) = (v,y)H- 


Orthogonal Projection 


A subset G of a Hilbert space H is said to be closed in H if every convergent 
sequence of G has a limit in G. 


Theorem A.0.7 Let G C H be a vector subspace of the Hilbert space H. Endow 
G with the inner product which is the restriction to G of the inner product on H. 
Then, G is a Hilbert space if and only if G is closed in H. 


G is then called a Hilbert subspace of H. 


Proof. (i) Assume that G' is closed. Let {2n},,<,) be a Cauchy sequence in G. It 
is a fortiori a Cauchy sequence in H, and therefore it converges in H to some 2, 
and this x must be in G, because it is a limit of elements of G and G is closed. 


(ii) Assume that G is a Hilbert space with the inner product induced by the 
inner product of H. In particular every convergent sequence {x,},,-y of elements 
of G converges to some element of G. Therefore G is closed. 


ne 


Definition A.0.8 Two elements x,y of the Hilbert space H are said to be or- 
thogonal if (x,y) = 0. Let G be a Hilbert subspace of the Hilbert space H. The 
orthogonal complement of G in H, denoted G+, is defined by 


Gt ={zeH: (z,x2) =0 for allx € G}. 


Clearly, G+ is a vector space over C. Moreover, it is closed in H since if {zn}, 
is a sequence of elements of G+ converging to z € H then, by continuity of the 
inner product, 

(z,v) =lim(z,,v) =0 forallae H. 
ntoo 


Therefore G+ is a Hilbert subspace of H. 


Note that a decomposition « = y+ z where y € G and z € G* is necessarily 
unique. Indeed, let x = y’ + 2’ be another such decomposition. Then, letting 
a=y-y’,b=z-—~2', we have that 0 = a+b where a € Gand b € G+. Therefore, 
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in particular, 0 = (a,a) + (a,b). But (a,b) = 0, and therefore (a,a) = 0, which 
implies that a = 0. Similarly, b = 0. 


Theorem A.0.9 Let G be a Hilbert subspace of H. For all x € H, there exists a 
unique element y € G such that « —y € G+. Moreover, 


lly — || = inf |lu — al. (A.1) 

Proof. Let d(x, G) = inf.cg d(x, z) and let {yn},5, be a sequence in G such that 
1 

d(x, G)? < d(x, yn)? < d(z,G)? + —. (x) 
n 


The parallelogram identity gives, for all m,n > 1, 


I. 


1 
[Im — Yall? = 2({]a — yall? + lla = Ymll”) = Alla: — 5 (im + Yn) 
Since $(Yn + Ym) € G, 


1 
2 — 5m + yn)Il? 2 da, G)’, 


and therefore 
7 1 1 
\|Yn — Ymll <2 = : 
n m 


The sequence {y,,},,., is therefore a Cauchy sequence in G and consequently it 
converges to some y € G since G is closed. Passing to the limit in (x) gives (A.1). 


Uniqueness of y satisfying (A.1): Let y’ € G be another such element. Then 


lz — yl = llz — yll = dz, @), 


and from the parallelogram identity 


1 
lly — 9? = lly — el)? + lly! — 2? — Alle — Sy +91? 
il 
=4d(x,G)*— 4]le- sy ty. 
Since $(yt+y’) €G, 
1 1) || 2 2 
lIe- Sa +y)IF 2 da, Gy", 


and therefore ||y — y'||? < 0, which implies ||y — y’||? = 0 and therefore y = y/. 
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It now remains to show that x — y is orthogonal to G, that is, (w — y,z) = 0 
for all z € G. Since this is trivially true if z = 0, we may assume z 4 0. Because 
y+Az€G for all AX € R, 


lz — (y+ Az)|? = d(x, G)’, 


that is, 
|lz — yl|? + 2ARe {(x — y, z)} + A’llz||? = d(x, GY’. 
Since ||z — y||? = d(x, G)?, we have 


—2\Re{(x — y,z)} +" ||2||? > 0 for all AE R, 


which implies Re {(a — y, z)} = 0. The same type of calculation with \ € iR (pure 
imaginary) leads to S {(a — y, z)} =0. Therefore (a — y, z) = 0. 


That y is the unique element of G such that y—2x € G+ follows from the remark 
preceding Theorem A.0.9. 


Definition A.0.10 The element y in Theorem A.0.9 is called the orthogonal pro- 
jection of x on G and is denoted by Pa(x). 


The projection theorem states, in particular, that for any x € G there is a 
unique decomposition 
g=yt+z, yEeG,zeG, 


and that y = Pg(x), the (unique) element of G closest to x. Therefore 


Theorem A.0.11 The orthogonal projection y = Pg(x) is characterized by the 
two following properties: 


(1) yeG; 
(2) (y—2,z) =0 for allz EG. 
This characterization is known as the projection principle of Hilbert spaces. 


Let C be a collection of vectors in the Hilbert space H. The linear span of 
C, denoted span(C) is, by definition, the set of all finite linear combinations of 
vectors of C. This is a vector space. The closure of this vector space, span(C), 
is called the Hilbert subspace generated by C. By definition, x belongs to this 
subspace if and only if there exists a sequence of vectors {z,},., such that 


(i) for all n > 1, x, is a finite linear combination of vectors of C, and 


(ii) limptos Un = @. 
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Theorem A.0.12 An element © € H is the projection of x onto G = span(C) 
if and only if 


(a) © EG, and 


(Dig oa) —0 for dice. 


Note that we have to satisfy requirement not for all z € G, but only for all 


zeEC. 


The proof is easy. We have to show that (x — %, z) = 0 for all z€ G. But z= 
limptoo Zn; Where {z,},,., is a sequence of vectors of span(C) such that limytoo Zn = 
z. By hypothesis, for all n > 1, (x — %, 2) = 0. Therefore, by continuity of the 
inner product, 
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